分类: C/C++
2009-03-25 20:07:35
Demystify C's greatest difficulty
Level: Intermediate
Cameron Laird (), Vice President, Phaseit, Inc.
13 Feb 2007
Updated 04 Apr 2007
Exercise good memory-related coding practices by creating a comprehensive program to keep memory errors under control. Memory errors are the bane of C and C++ programming: they're common, awareness of their importance for over two decades hasn't eradicated them, they can impact applications severely, and few development teams have a definite plan for their management. The good news, though, is that they needn't be so mysterious.
Introduction
Memory errors in C and C++ programs are bad: they're common, and they can have serious consequences. Many of the gravest security notices from the Computer Emergency Response Team (see Resources) and vendors are commentaries on simple memory errors. C programmers have talked about this class of error since the late '70s, but their impact remains large in 2007. Worse, if my impressions are any guide, many of today's C and C++ coders seem to regard memory errors as uncontrollable and mysterious afflictions from which one can only recover, not prevent.
It's not so. This article shows that it's possible to understand all the essentials of good memory-related coding in a single sitting:
Importance of correct memory management
Categories of memory errors
Strategies of memory programming
Conclusion
Importance of correct memory management
C and C++ programs with memory errors cause problems. If they leak memory, they run progressively slower and eventually halt; if they overwrite memory, they are fragile and likely vulnerable to hijacking by a malignant user. Exploits from the famous Morris worm of 1988 to the latest security alerts on Flash Player and other crucial retail-level programs relied on buffer overflows: "The majority of computer security holes are buffer overruns," wrote Rodney Bates in 2004.
Many other general-purpose languages, such as Java™, Ruby, Haskell, C#, Perl, Smalltalk, and so on, are widely enlisted in situations where C or C++ might instead be used, and each has significant enthusiasts and benefits. Part of the folklore of computing, though, is that the majority of the usability advantage each has over C or C++ has to do strictly with ease of memory management. Memory-related programming is so important, and its correct application so difficult in practice, as to dominate all other variables or theories of object-oriented, functional, high-level, declarative, and other qualities of programming languages.
Memory errors also can be insidious in a way common to few other classes of errors: They're hard to reproduce and symptoms often are difficult to localize in the corresponding source code. A memory leak, for example, might render an application entirely unacceptable at the same time it is opaque, regardless of where or when the leak occurs.
For all these reasons, then, memory aspects of C and C++ programming deserve special consideration. Let's see what you can do about them, short of avoiding the languages.
Categories of memory error
First, don't despair. There are answers to memory challenges. Start with a list all of all the effective possible difficulties:
Memory leak
Misassignment, including multiply free()d memory and uninitialized references
Dangling pointers
Array bounds violations
That's the whole list. Even moving to C++'s object-orientation language doesn't change the categories significantly; the model of memory management and reference in C and C++ is fundamentally the same, whether the data is the simple types and structs of C, or C++'s classes. Most of what follows is in "pure C," with extension to C++ largely left as an exercise.
Memory leaks
Memory leaks occur when a resource is allocated, but it's never reclaimed. Here's a model for what can go wrong (see Listing 1):
Listing 1. Simple potential heap memory loss and buffer overwrite
void f1(char *explanation)
{
char *p1;
p1 = malloc(100);
(void) sprintf(p1,
"The f1 error occurred because of '%s'.",
explanation);
local_log(p1);
}
Do you see the problem? Unless local_log() takes the unusual responsibility for free()ing the memory it's passed, invocation of f1 leaks 100 bytes each time it's called. This is tiny in a time when megabytes are given away in memory sticks as promotional items but, over hours of continuous operation, even such small losses can cripple an application.
In practical C and C++ programming, it is not enough to sanitize your use of malloc() or new.The sentence at the beginning of this section mentioned "resources" rather than just "memory" precisely because of examples like this one (see Listing 2). FILE handles might not look like memory blocks, but they must be handled with the same care:
Listing 2. Potential heap memory loss from resource mismanagement
int getkey(char *filename)
{
FILE *fp;
int key;
fp = fopen(filename, "r");
fscanf(fp, "%d", &key);
return key;
}
The semantics of fopen require a complementary fclose. While the C standard doesn't specify what happens without the fclose(), it's likely to leak memory. Other resources, such as semaphores, network handles, database connections, and so on, deserve the same consideration.
Memory misassignments
Less difficult to manage are misassignments. Here's an example (see Listing 3):
Listing 3. An uninitialized pointer
void f2(int datum)
{
int *p2;
/* Uh-oh! No one has initialized p2. */
*p2 = datum;
...
}
The good news about errors such as this is that they tend to have dramatic consequences. Under AIX®, assignment to an uninitialized pointer generally results in an immediate segmentation fault. This is good because any such faults are detected swiftly; these errors are much cheaper than ones that take months to identify and are difficult to reproduce.
There are several variations within this category. Memory can be free()d more often than malloc()ed (see Listing 4):
Listing 4. Two erroneous memory de-allocations
/* Allocate once, free twice. */
void f3()
{
char *p;
p = malloc(10);
...
free(p);
...
free(p);
}
/* Allocate zero times, free once. */
void f4()
{
char *p;
/* Note that p remains uninitialized here. */
free(p);
}
These errors also are often not grave. Although the C standard doesn't define behavior in these cases, typical implementations ignore the faults, or flag them swiftly and vividly; as above, these are safe situations.
Dangling pointers
Dangling pointers are more troublesome. A dangling pointer arises when a programmer uses a memory resource after it has been freed (see Listing 5):
Listing 5. Dangling pointers
void f8()
{
struct x *xp;
xp = (struct x *) malloc(sizeof (struct x));
xp.q = 13;
...
free(xp);
...
/* Problem! There's no guarantee that
the memory block to which xp points
hasn't been overwritten. */
return xp.q;
}
Traditional "debugging" has difficulty isolating dangling pointers. They're poorly reproducible for a couple of distinct reasons:
Even if the code affecting the prematurely-freed memory range is localized, use of the memory might depend on execution elsewhere in the application or, in extreme cases, even in a different process.
Dangling pointers are likely to arise in code, which uses memory in subtle ways. The consequence is that, even if memory is overwritten immediately on freeing and the new pointed value differs from the expected one, the new value might be hard to recognize as erroneous.
Dangling pointers are a constant threat to the health of C or C++ programs.
Array bounds violations
Not safe at all are the array bounds violations, which is the final major category of memory mismanagement. Look back at Listing 1; what happens if the length of explanation exceeds 80? Answer: It's hard to predict, but it's probably far from good. More specifically, C copies a string that doesn't fit into the 100 characters allocated for it. In any common implementation, the "excess" characters overwrite other data in memory. The layout of data allocations in memory is complex and subtle to reproduce, so any symptoms might be hard to connect back to the specific error at the level of source code. These are among the errors that regularly result in millions of dollars of damage.
Strategies of memory programming
Diligence and discipline can reduce the incidence of these errors to near zero. Let's go over several specific steps you can take; my experience with these in a variety of organizations is that they consistently slash memory errors by at least an order of magnitude.
Coding style
The most important, and the one I have never seen emphasized by any other author, is a coding standard. Functions and methods which impact resources, especially memory, need to explain themselves explicitly. Here are examples of pertinent headers, comments, or names (see Listing 6).
Listing 6. Examples of resource-aware source code
/********
* ...
*
* Note that any function invoking protected_file_read()
* assumes responsibility eventually to fclose() its
* return value, UNLESS that value is NULL.
*
********/
FILE *protected_file_read(char *filename)
{
FILE *fp;
fp = fopen(filename, "r");
if (fp) {
...
} else {
...
}
return fp;
}
/*******
* ...
*
* Note that the return value of get_message points to a
* fixed memory location. Do NOT free() it; remember to
* make a copy if it must be retained ...
*
********/
char *get_message()
{
static char this_buffer[400];
...
(void) sprintf(this_buffer, ...);
return this_buffer;
}
/********
* ...
* While this function uses heap memory, and so
* temporarily might expand the over-all memory
* footprint, it properly cleans up after itself.
*
********/
int f6(char *item1)
{
my_class c1;
int result;
...
c1 = new my_class(item1);
...
result = c1.x;
delete c1;
return result;
}
/********
* ...
* Note that f8() is documented to return a value
* which needs to be returned to heap; as f7 thinly
* wraps f8, any code which invokes f7() must be
* careful to free() the return value.
*
********/
int *f7()
{
int *p;
p = f8(...);
...
return p;
}
Make these stylistic elements part of your routine. There are all sorts of approaches to memory issues:
Special-purpose libraries
Languages
Software tools
Hardware checkers
Over this entire domain, the one step I've most consistently found useful and with the biggest return on its investment is thoughtful improvement of source code style. It needn't be expensive or rigidly formal; memory-neutral segments can be left uncommented as always, and memory-impacting definitions surely deserve explicit comment. Put in a few simple words to make memory consequences clear, and your memory programming improves.
I haven't done controlled experiments to validate the effects of this style. If your experience is anything like mine, you'll find you don't want to live without a policy of commenting resource impact. To do so simply pays off too well.
Inspection
Supplementary to coding standards is inspection. Either helps on its own, but they're particularly potent in partnership. An alert C or C++ practitioner can scan even unfamiliar source code and detect memory problems at very low cost. With a little practice and appropriate textual searches, you can quickly develop an ability to validate source corpora for balanced *alloc() and free(), or new and delete. Human source review of this sort often turns up problems like the one in Listing 7.
Listing 7. A troublesome memory leak
static char *important_pointer = NULL;
void f9()
{
if (!important_pointer)
important_pointer = malloc(IMPORTANT_SIZE);
...
if (condition)
/* Ooops! We just lost the reference
important_pointer already held. */
important_pointer = malloc(DIFFERENT_SIZE);
...
}
Superficial use of automatic run-time tools doesn't detect the memory leak that occurs if the case condition is true. Careful source analysis can reason through such conditionals to provably correct conclusions. I repeat what I wrote about style: While most published descriptions of memory problems emphasize tools and languages, for me, the greatest gains come from "soft," developer-centered process changes. Any improvements you make in style and inspection help you understand the diagnostics produced by automatic tools.
Static automatic syntax analysis
Humans aren't the only ones who can read source code, of course. You should also make static syntax analysis part of your development process. Static syntax analysis is what lint, strict compilation, and several commercial products do: Scan a source text and spot items that a compiler accepts, but that are likely to be symptoms of mistakes.
Expect to make your code lint-free. While lint is old and limited, the many programmers who don't bother with it (or its more advanced descendants) make a big mistake. It is possible, in general, to write good, professional-quality code which passes lint, and the effort to do so usually turns up significant errors. Some of these affect memory correctness. Even payment of the most expensive license fees among the products available in this category loses its sting when compared to the costs of having a customer be the first to identify a memory error. Clean your source code. Even if a coding that lint flags appears to give you the functionality you want now, it's very, very likely that a cleaner approach exists, one that satisfies lint and is more robust and portable.
Memory libraries
The final two categories of remedy are distinct from the first three. The former are light-weight; an individual can readily understand and implement them. Memory libraries and tools, on the other hand, have generally higher license fees, and they require more sophistication and judgment on the part of the developer. The programmers who use libraries and tools effectively are those who understand the light-weight, static approaches. The available libraries and tools are impressive: Their quality, as a group, is quite high. Even the best ones can be foiled, though, by a sufficiently willful programmer committed to ignoring basic principles of memory management. From what I've seen, mediocre programmers working in isolation only frustrate themselves when they try to take advantage of memory libraries and tools.
For all these reasons, I urge C and C++ programmers to start by looking at their own source for memory problems. Having done that, it's time to consider libraries.
Several libraries make it possible to write conventional-looking C or C++ code, with the assurance of improved memory management. Jonathan Bartlett described leading candidates in a 2004 review for developerWorks, available through the Resources section below. Libraries address so many different memory issues that it's difficult to compare them directly; common rubrics in the domain include garbage collection, smart pointers, and smart containers. In broad terms, the libraries automate more of memory management so that the programmer makes fewer errors.
I have mixed feelings about memory libraries. They should work, but their success in the projects I've seen has been less than expected, especially on the C side. I don't yet have a good analysis for these disappointing outcomes. Performance, for example, ought to be as good as comparable manual memory management, but this is a gray area -- especially in situations where garbage-collecting libraries seem to slow processing. My most definite conclusion from working in this area is that the C++ culture seems to accept smart pointers better than groups of C-focused coders.
Memory tools
Development teams putting out serious C-based applications need a run time memory tool as part of their development strategy. The techniques already described are valuable and necessary. The quality and functionality of the memory tools available can be hard for you to appreciate until you've tried them for yourself.
This introduction only focuses on software-based memory tools. Hardware memory debuggers also exist; I regard them as needed only for very special situations -- mostly when working with specialized hosts that don't support other tools.
The marketplace of software memory tools includes both proprietary ones like IBM Rational® Purify, Electric Fence, and other open source tools. Several of each work well with AIX, among other operating systems.
All the memory tools operate roughly the same: Build a special version of your executable (much as you might generate a debugging version by using the -g flag when compiling), exercise the application, and study reports automatically generated by the tool. Consider a program like that of Listing 8.
Listing 8. Sample error
int main()
{
char p[5];
strcpy(p, "Hello, world.");
puts(p);
}
In many environments, this program "works," and it compiles, executes, and prints "Hello, world.\n" to the screen. Running the same application with a memory tool results in a report of an array-bounds violation on the fourth line. To learn of a software fault that fourteen characters have been copied into a space guaranteed to hold only five -- this way is considerably less expensive than finding out from a customer about a symptom of failure. That's the contribution of memory tools.
Conclusion
As a mature C or C++ programmer, you recognize that memory problems deserve serious attention. With a little planning and practice, you can come up with an approach that brings memory hazards under control. Learn correct patterns for memory use, be sensitive to the errors likely to occur, and make the techniques described in this article part of your daily routine. You can begin to eliminate symptoms from your applications that otherwise might take days or weeks to debug.
Cameron Laird is a long-time developerWorks contributor and former columnist. He often writes about the open source projects that accelerate development of his employer's applications, focused on reliability and security. He first used AIX twenty years ago, when it was still an experimental product. He's been an enthusiastic consumer of and contributor to a variety of memory debugging tools through that time. You can contact him at . |