分类: C/C++
2010-05-24 08:55:08
Abstract: A substantial portion of a knowledge worker's life may be spent waiting for a computer program to produce output. Users and organizations control their wait time by purchasing faster computers, adding memory, or using faster network connections. Developers of application programs have a responsibility to design their programs make the best use of these limited and expensive resources.
This document describes techniques for optimizing (improving the speed of) computer programs written in C. It focuses on minimizing time spent by the CPU and gives sample source code transformations that often yield improvements. Memory and I/O speed improvements are also discussed.
.
The single most effective optimization technique is to use a profiler to identify performance bottlenecks. It's often difficult to guess what part of your program is consuming the most resources, and if you base your optimization efforts on speculation instead of real data, you'll waste a lot of time speeding up the parts of your program that were fast already.
Once you have identified a bottleneck, for example a loop that is executed thousands of times, remember that the best thing to do is to redesign the program so that it doesn't need to execute the loop thousands of times. This is more effective than making the loop run 10% faster, yet just as often, which is precisely the sort of thing an optimizing compiler can do by itself. Optimization is simply waste of programmer time if any of these statements are true:
Also take into account how the program will be used. If it's a report-generator program which only needs to be run once a day, the user may start it before going off to lunch and if so there's really no point in making it finish before they get back. If it's being invoked from within another program that's even slower than yours, again the user will notice no difference. But if it handles mouse-tracking events for a GUI the user will complain about any noticeable delay.
Given that optimizing is reasonable, compile in full optimize mode and run your program on "real-world" input data. If you don't have access to real input data, choose your test input data with care: programmers tend to test with a smaller set of input data and a different variety of cases than the the program will likely encounter once it's in the hands of users.
Use the time(1) command to see if your program is compute-bound, memory bound, or I/O bound, or for that matter if it's bound at all: even if your program "seems" to be taking a long time to run, it may only be a second of CPU time. You may have more "time" commands on your system than you realize: sometimes it's built into the shell (SunOS csh for example) and has lots of nifty options. You can also get performance information from getrusage() if you have it, and of course from profiling programs like gprof, prof, and tcov.
Recompile your program with profiling enabled and whatever optimization options are compatible with that. Run your program, again on real world data, and generate a profiling report. Figure out which function uses the most CPU time, then look it over very carefully and see if any of these approaches might be useful. Make one change at a time, then run the profiler again, repeating the process until there is no obvious bottleneck or the program runs sufficiently fast.
The first four are quite important; the rest are in no particular order.
Think about what the code is really doing. Become familiar with the body of literature that describes your specialty and learn and use the most appropriate algorithms, even if you didn't come up with them yourself. You should be familiar with O(n) notation, which is defined in many computer science texts.
Some of the obvious replacements:
slow algorithm replace with sequential search binary search or hash lookup insertion or bubble sort quick sort, merge sort, radix sort
Also choose an appropriate data structure. If you'll be doing a lot of insertions and deletions at random places then a linked list would be good. If you'll be doing some binary searching, an array would be better.
Some of the very things that make code clear and readable to humans also make it clear and readable to compilers. Complicated expressions are harder to optimize and can cause the compiler to "fallback" to a less intense mode of optimization. I've seen one compiler that has translation-unit limits which when overrun will cause the entire module to be de-optimized by one level.
Part of the clarity is making hunks of code into functions when appropriate. The cost of a function call is extremely small on modern machines, so optimization is NOT a valid excuse for writing ten-page functions.
If you write clean, portable code you can quickly port to the latest, fastest machine and offer that as a solution to your customers who are interested in speed.
Get a sense for how long certain operations take. Among the the slowest are opening a file, reading or writing significant amounts of data, starting a new process, searching, sorting, operations on entire arrays, and copying large amounts of data around. The fastest operations are basic elements of the language like assigning to a variable, dereferencing a pointer, or adding two integers. None of these operations take very long in and of themselves (a few microseconds) but all computer languages allow sections of code to be executed repeatedly. If you perform even the fastest operation 10 million times, it will take a noticeable amount of time. In real programs, various things happen various numbers of times and having some common sense can help you interpret the output from your profiler.
A sure sign of
misunderstanding is this fragment: if (x != 0) x = 0;
For the intrepid hacker, there is no substitute for examining the assembler-level output the compiler generates and counting the instructions. If you do this, don't forget to include wait states in your tally. Also keep in mind that some optimization and instruction scheduling is put off until link time and won't show up in the assembler output for an individual module.
I have a small,
fairly portable program that prints out the approximate unoptimized
speed of basic C operators and library functions, . It's not a benchmark (the measurements are not
weighted or averaged in anyway) but it'll give a beginner a numerical
sense to the cost of various things in C.
Note: (a) I have no
anonymous ftp server at my site, so you'll have to look at it in your
browser and save it to a file or copy and paste it into an editor. (b)
Be sure to compile it with optimization off. Most of the code
is trivial and would be eliminated by an optimizer; the timing would end
up being mostly zeroes.
Most compilers have several levels of optimization. Make sure you're using the highest level. One popular compiler, gcc, has an incredible variety of optimization levels and options. Some compilers have special #pragmas or keywords (for example, inline) which also affect optimization.
High levels of optimization can make poorly written (but compilable) code break. Less often, optimizers mangle perfectly correct code. Problems with side effects and evaluation order tend to screw things up, so you may have to fix the code up a little before you can use the highest level. Signal handlers activated at inopportune times may disturb an optimized program in ways that wouldn't disturb an unoptimized one.
gcc (with the -finline-functions option) and a few other compilers are capable of inlining small functions at higher optimization levels automatically. K&R C compilers that I've seen won't do it at all or will only do it for library functions written in assembly (for example the math library). C++ compilers almost universally support inline functions.
If necessary, C functions can be recoded as macros to obtain similar speedup on compilers with no inlining capability. This should be done after the code is completely debugged. All debuggers I've seen are incapable of displaying or stepping through the expanded macro text.
An example of inlining via macros:
Old code: int foo(int a, int b) | New code: #define foo(a, b) (((a)-(b)) * ((b)+1)) |
The extra parentheses are necessary to preserve grouping in case foo is used in an expression with higher precedence than * or in case a and/or b contain subexpressions with lower precedence than + or -.
Comma expressions and do { ... } while(0) can be used for more complicated functions, with some restrictions. The do-while macros let you create local variables for use in the macro, but you can't return a value to the expression the macro is used in. The opposite is true for macros using comma expressions.
Some caveats:
Many compilers (e.g. gcc -funroll-loops) will do this, but if you know that yours doesn't you can change your source code a bit to get the same effect.
Old code: for (i = 0; i < 100; i++) | New code: for (i = 0; i < 100; ) |
This way the test for i < 100 and the branch back up to the top of the loop only get executed 11 times rather than 101. Loop unrolling works best when the loop is executed a fixed non-prime number of times and the iteration variable is only modified in one place (aside from its initialization). If do_stuff() didn't make use of i, all the littlei++'s could be replaced by a single i += 10. Re-arranging the for loop into a do-while loop can make the 11 into 10. If the loop only went to, say, five rather than 100 you could unroll the loop completely and eliminate the branching and testing entirely.
For computations which converge on some result, compilers will often refuse to unroll on the grounds that unrolling will change the result. If the application is not sensitive to excess precision you can arrange to test the convergence less often by duplicating the interior of the loop. This is especially useful if the test for convergence is expensive in relation to the calculation of the result.
An unrolled loop is larger than the "rolled" version and so may no longer fit into the instruction cache (on machines which have them). This will make the unrolled version slower. Also, in this example, the call to do_stuff()overshadows the cost of the loop, so any savings from loop unrolling are insignificant in comparison to what you'd achieve from inlining in this case.
If you happen to be using a vectorizing compiler, loop unrolling can interfere with the vector optimizations.
The idea is to combine adjacent loops which loop over the same range of the same variable. Assuming nothing in the second loop indexes forward (for example array[i+3]), you can do this:
Old code: /* initialize 2d array | New code: for (i = 0; i < MAX; i++) |
And now the incrementing and testing of i is done only half as often. Under some circumstances, locality of reference may be better, improving cache behavior. (The example above was lifted from Aho & Ullman.)
Some machines have a special instruction for decrement and compare with 0. Assuming the loop is insensitive to direction, try this replacment:
Old code: for (i = 1; i <= MAX; i++) | New code: i = MAX+1; |
Though watch out for problems with lookahead caches as noted below in MARCH FORWARD. If you plan on doing this in combination with pointer arithmetic note that while ANSI C has a special rule that allows you to set an index to one element past the end of the array, there is no similar rule for one element before the beginning of the array.
Strength reduction is the replacement of an expression by a different expression that yields the same value but is cheaper to compute. Many compilers will do this for you automatically. The classic examples:
Old code: x = w % 8; | New code: x = w & 7; /* bit-and cheaper than remainder */ |
Also note that array indexing in C is basically a multiply and an add. The multiply part can be subjected to strength reduction under some circumstances, notably when looping through an array.
Any part of a computation that does not depend on the loop variable and which is not subject to side effects can be moved out of the loop entirely. Most compilers are pretty good at doing this. Try to keep the computations within the loop simple anyway, and be prepared to move invariant computations out yourself: there may be some situations where you know the value won't vary, but the compiler is playing it safe in case of side-effects. "Computation" here doesn't mean just arithmetic; array indexing, pointer dereferencing, and calls to pure functions are all possible candidates for moving out of the loop.
In loops which call other functions, you might be able to get some speedup by ripping the subroutines apart and figuring out which parts of them are loop-invariant for that particular loop in their caller and calling those parts ahead of time. This is not very easy and seldom leads to much improvement unless you're calling subroutines which open and close files repeatedly or malloc and free large amounts of memory on each call or something else drastic.
A common but not-always-optimized-away case is the repeated use of an expression in successive statements. This doesn't have to be related to a loop at all. You can try this replacement and see if it helps:
Old code:
total =
a->b->c[4]->aardvark +
a->b->c[4]->baboon +
a->b->c[4]->cheetah +
a->b->c[4]->dog;
New code:
struct animals * temp = a->b->c[4];
total =
temp->aardvark +
temp->baboon +
temp->cheetah +
temp->dog;
Older C compilers were allowed to regroup arithmetic expressions to some extent. ANSI codified that arithmetic expressions are evaluated as they are grouped so as to avoid any unwanted surprises. What this means is that
float a, b, c, d, f, g;
...
a = b / c * d;
f = b * g / c;
will not been seen as having the common subexpression b / c. If you rewrite the second expression:
float a, b, c, d, f, g;
...
a = b / c * d;
f = b / c * g;
an ANSI C compiler is now free to compute b / c only once. Note that the new code may behave differently for overflow or provide slightly different results with floating point.
In a section of code which deals with several alternative situations, place at the beginning the tests and the code for the situations which occur most often. Frequently, this takes the form of a long train of mutually exclusive if-then-else's, of which only one will get executed. By placing the most likely one first, fewer if's will need to be performed over the long term.
But if the conditions are simple things like x == 3, consider using a switch statement. Some compilers are quite sophisticated in how they translate switch statements.
First, let me define tail recursion elimination (TRE). When a recursive function calls itself, an optimizer can, under some conditions, replace the call with an assembly level equivalent of a "goto" back to the top of the function. The saves the effort of growing the stack, saving and restoring registers, and any other function call overhead. For very small recursive functions that make zillions of recursive calls, TRE can result in a substantial speedup. With proper design, the TRE can take a recursive function and turn it into whatever is the fastest form of loop for the machine.
Tail recursion elimination (TRE for short) has been around for a long time. It originated with "functional" languages like LISP which do so much recursion that TRE is a necessity. C, C++, and the "pascal-like" languages fall into the "imperative" language category and extremely efficient programs, recursive or not, can be written and compiled without the benefit of TRE and still perform on par with (and often better than) a similar LISP program. What I'm leading up to is that TRE is not automatically present in every modern optimizer, though many certainly do have it.
Back to the conditions I mentioned earlier. In order for TRE to be a safe optimization the function must return the value of the recursive call without any further computation. Example:
int isemptystr(char * str)
{
if (*str == '\0') return 1;
else if (! isspace(*str)) return 0;
else return isemptystr(++str);
}
The above can have TRE applied to the final return statement because the returned value from this invocation of isemptystr will be exactly that of the n+1th invocation, with no further computation.
And now a counterexample:
int factorial(int num)
{
if (num == 0) return 1;
else return num * factorial(num - 1);
}
The above cannot have TRE applied because the returned value is not used directly: it is multiplied by num after the call, so the state of that invocation must be maintained until after the return. Even a compiler that supports TRE cannot use it here.
And now a counter-counterexample, a rewrite of the factorial program to allow TRE optimization.
int factorial(int num, int factor)
{
if (num == 0) return factor;
else return factorial(num - 1, factor * num);
}
In my experience, the optimizers for imperative languages don't bother to perform this sort of rewriting. I have seen it done with Scheme compilers, so it is possible. Programmers who write code this way (other than for example purposes!) should be beaten about the head and shoulders with a blunt dandruff-removing object.
Even if your compiler implements TRE optimization, you should not assume that it automatically has done so for you. You may have to rewrite even the simplest recursive function before TRE can be applied. If doing so reduces the readability of the function, the effort is highly questionable.
There is a large subset of recursive algorithms that have simple iterative counterparts. C compilers are very, very good at optimizing loops and can do so without the sometimes-onerous conditions that TRE requires, so if you must optimize at the source level, consider using plain iteration before TRE-friendly rewriting.
Finally, if the recursive function contains loops or large amounts of code, TRE will be only marginally helpful, since TRE only optimizes the recursion, not anything about the algorithm itself. Function calls are very fast, and optimizing out something that is already very fast will quickly run into the law of diminishing returns.
Consider using lookup tables especially if a computation is iterative or recursive, e.g. convergent series or factorial. (Calculations that take constant time can often be recomputed faster than they can be retrieved from memory and so do not always benefit from table lookup.)
Old code:
long factorial(int i)
{
if (i == 0)
return 1;
else
return i * factorial(i - 1);
}
New code:
static long factorial_table[] =
{1, 1, 2, 6, 24, 120, 720 /* etc */};
long factorial(int i)
{
return factorial_table[i];
}
If the table is too large to type, you can have some initialization code compute all the values on startup or have a second program generate the table and printf it out in a form suitable for #include-ing into a source file. At some point, the size of the table will have a noticeable affect on paging space. You could have the table cover the first N cases then augment this with a function to compute the rest. As long as a significant number of the values requested are in the table there will be a net win.
For nearly all situations, the library qsort function is speedy enough to make implementation of your own sort algorithm unnecessary. Focus your optimization efforts on speeding up the comparison function. Often the strcmpoptimizations discussed later in this document are helpful.
Consider massaging the input data ahead of time in such a way as to make the comparison routine run faster:
If you are sorting array of structs or other large objects, sort an array of pointers to them instead so the sort function won't waste time copying whole structs around during the sort.
Evaluate ratio of insertions to lookups.
If you are writing the sort function yourself, write a specific sort for a specific type of array and a specific comparison function. Incorporate everything into your sort function, essentially inlining the comparison function and swapping code.
For custom sorts patterned after qsort you can make special copies of the sort which assume that data being swapped is a convenient machine size, like short, long, or double. When the width of the object being sorted matches the size of a basic type (and the alignment is suitable) you can do a direct swap instead of calling memcpy.
Make sure assert or other debugging statements are commented out or #defined away.
Avoid referring to global or static variables inside the tightest loops. Don't use the volatile qualifier unless you really mean it. Most compilers take it to mean roughly the opposite of register, and will deliberately not optimize expressions involving the volatile variable.
Avoid passing addresses of your variables to other functions. The optimizer has to assume that the called function is capable of stashing a pointer to this variable somewhere and so the variable could get modified as a side effect of calling what seems like a totally unrelated function. At less intense levels of optimization, the optimizer may even assume that a signal handler could modify the variable at any time. These cases interfere with placing variables in registers, which is very important to opimization. Example:
a = b();
c(&d);
Because d has had its address passed to another function, the compiler can no longer leave it in a register across function calls. It can however leave the variable a in a register. The register keyword can be used to track down problems like this; if d had been declared register the compiler would have to warn that its address had been taken.
While functions and modularity are a good thing, a function call inside an oft-executed loop is a possible bottleneck. There are several facets to the problem, some of which I've alluded to above. Aside from the expense of executing the instructions in the other function:
Straight-line code, even with an extra statement or two, will run faster than code full of if's, &&'s, switch's, andgoto's. Pipelining processors are much happier with a steady diet of sequential instructions than a bunch of branches, even if the branches skip some unnecessary sections.
Most of the C library str* and mem* functions operate in time proportional to the length(s) of the string(s) they are given. It's quite easy to loop over calls to these and wind up with a significant bottleneck. Some things that may ease the situation:
#define
QUICKIE_STRCMP(a, b) (*(a) != *(b) ? \
(int) ((unsigned char) *(a) - \
(unsigned char) *(b)) : \
strcmp((a), (b)))
But watch out:
An entirely different way to speed up strcmp is to place all your strings into a single array, in order. Then you only have to compare the pointers, not the strings. If the point of all the calls to strcmp is to search for a value from a large, known set and you expect that you'll be doing many such searches then you'll want to invest in a hash table.
In some really weird cases (early RISC chips with FPUs, which includes many SPARC-based machines) you can get some speed by turning your integer multiplies, divides, and modulos into float operations. This does two things: the FPU is taking some of the load off the main CPU so it can get some other stuff done; but mostly some RISC machines emulate integer multiply and divide in software and the FPU can actually do it faster anyway. This is not a sure thing and you should try it both ways on all the machines you plan to run on before you do this. Also, converting back-and-forth between ints and floats can take up considerable time. I have heard this is an acute problem on Apollos.
On many machines, floats work faster than doubles. If the bottleneck involves FP arithmetic and you don't need the extra precision, change the pertinent variables to float and see what happens. But similar to above, the cost of converting between floats and doubles can outweigh the benefits if applied indiscriminately. Note that on K&R compilers and ANSI compilers used w/o prototypes, there is an automatic conversion of floats to doubles in function calls and in many expressions.
GCC produces somewhat faster code than the compilers that come with some OS's. Try compiling your code with all the compilers you have at your disposal and use the fastest one for compiling the one or two functions which are bottlenecks. For the rest just use the one that gives the most informative error messages and produces the most reliable output. This is more difficult with C++ since different compilers can use incompatible name-mangling schemes.
Compilers are evolving even as we speak, so keep up with the latest revision if possible.
A typical cause of stack-related problems is having large arrays as local variables. In that case the solution is to rewrite the code so it can use a static or global array, or perhaps allocate it from the heap. A similar solution applies to functions which have large structs as locals or parameters. (On machines with small stacks, a stack bound program may not just run slow, it might stop entirely when it runs out of stack space.)
Recursive functions, even ones which have few and small local variables and parameters, can still affect performance. On some ancient machines there is a substantial amount of overhead associated with function calls, and turning a recursive algorithm into an iterative one can save some time.
A related issue is
last-call optimization. Many LISP and Scheme compilers do this
automatically, but few C compilers support it. Consider this example: int func1()
{
int a, b, c, etc;
do_stuff(a, b, c)
if (some_condition)
return func2();
else
return 1;
}
Since func1()'s last statement is to call func2() and func1() has no need of its variables after that point, it can remove its stack frame. When func2() is done it returns directly to func1()'s caller. This (a) reduces the maximum depth of the stack and (b) saves the execution of some return-from-subroutine code as it will get executed only once instead of twice (or more, depending on how deeply the function results are passed along.)
If func1() and func2() are the same function (recursion) the compiler can do something else: the stack can be left in place and the call can be replaced by a goto back to the top of the function. This is called tail-recursion elimination.
Estimates vary widely, but a competent human writing assembly-level code can produce code which runs about 10% faster than what a compiler with full optimization on would produce from well-written high-level source. Of course, this is not a portable solution. RISC assembler is especially difficult to write by hand.
Experts vary on the approach one should take. Some suggest writing your own assembler version of a function from scratch in hopes that you'll find a novel way of calculating the result while others recommend taking the compiler's version as a starting point and just fine-tuning it.
Dynamically linked shared libraries are a really nifty thing. In many cases though, calling a dynamically linked function is slightly slower than it would be to call it statically. The principal part of this extra cost is a one-time thing: the first time a dynamically called function is called there is a bit of searching going on, but thereafter the overhead should be a very tiny number of instructions.
For applications with thousands of functions, there can be a noticeable lag at startup. Linking statically will reduce this, but defeats (to some extent) the benefits of code sharing that dynamic libraries can bring. Often, you can selectively link some libraries as dynamic and others as static. For example, the X11, C and math libraries you'd link dynamically (since other processes will be using these also and the program can use to the copy already in memory) but still link your own application-specific libraries statically.
As with other machine-specific code, you can use #ifdef to set off sections of code which are optimized for a particular machine. Compilers don't predefine RISC or SLOW_DISK_IO or HAS_VM or VECTORIZING so you'll have to come up with your own and encode them into a makefile or header file.
The foremost consideration when optimizing memory access, whether for virtual memory or cache considerations, is locality of reference. That is the property of a program to use addresses which are near (in both time and location) other recent references to memory. The main difference between optimizing for VM and optimizing for cache is scale: VM pages can be anywhere from 0.5kb to 8kb and beyond and can take tens of milliseconds to read in from disk. Cache blocks typically range from 16 bytes to 256 bytes and get read in in the tens of microseconds. A program which forces many VM pages or cache lines to load in quick succession is said to be "thrashing."
You can affect locality of reference by changing the order in which you access and allocate things and by splitting your data structures into "frequently used" and "infrequently used" segments and allocating all the "frequently used" stuff together. (But don't fool yourself into thinking that malloc always allocates adjacent chunks of memory on each successive call. You have to allocate one giant chunk and dole it out yourself to be certain. But this can lead to other problems.)
Search and sort algorithms differ widely in their patterns of accessing memory. Merge sort is often considered to have the best locality of reference. Search algorithms may take into consideration that the last few steps of the search are likely to take place in the same page of memory, and select a different algorithm at that point.
When stepping through multi-dimensional arrays,
be sure to increment the rightmost index first. This natural for C
programmers, but backwards for FORTRAN. If you find yourself writing C
code like this, you probably need to be re-educated: float
array[20][100];
int i, j;
for (j = 0; j < 100; j++)
for (i = 0; i < 20; i++)
array[i][j] = 0.0;
Instead of copying strings, arrays, or large structs, consider copying a pointer to them. As long as you're done using the pointer before you modify the string, array, or struct you're okay.
ANSI C now requires that structs are pass-by-value like everything else, thus if you have extraordinarily large structs, or are making millions of function calls on medium-sized ones, you might consider passing the struct's address instead, after modifying the called function so that it doesn't perturb the contents of the struct.
If the parts of your program making heaviest use of memory are doing so by accessing elements in "parallel" arrays you can combine them into an array of structs so that the data for a given index is kept together in memory.
If you already have an array of structs, but find that the critical part of your program is accessing only a small number of fields in each struct, you can split these fields into a separate array so that the unused fields do not get read into the cache unnecessarily.
On machines with alignment restrictions you can sometimes save a tiny amount of space by arranging similarly-typed fields together in a structure with the most restrictively aligned types first. There may still be padding at the end, but eliminating that can destroy the performance gains the alignment was meant to provide.
Old code: /* sizeof = 64 bytes */ | New code: /* sizeof = 48 bytes */ |
ANSI C makes few guarantees about internal struct padding, but an arrangement like that usually results in minimal losses even on the pickiest of RISC machines.
A typical use of char or short variables is to hold a flag or mode bit. You can combine several of these flags into one byte using bit-fields at the cost of data portability. You might instead use bitmasks and the &operator to extract the values; this is more portable but also more tedious.
Paradoxically, increasing the size and alignment of a data structure to match (or to be an integer fraction or multiple of) the cache line size may increase performance. The rationale is that if the data structure is an odd size relative to the cache line size, it may overlap two cache lines thus doubling the time needed to read it in from main memory.
The size of a data structure can be increased by adding a dummy field onto the end, usually a character array. The alignment is harder to control, but usually one of these techniques will work:
Theoretically, it makes no difference whether you iterate over an array forwards or backwards, but some caches are of a "predictive" type that tries to read in successive cache lines even before you need them. Because these caches must work quickly, they tend to be fairly dim and rarely have the extra logic for predicting backwards traversal of memory pages.
One common method of mapping memory addresses to cache lines is to strip off leading bits from the address.
Take as an example a hypoythetical direct-mapped 1MB cache with 128-byte cache lines and a program which uses 16MB of memory, all happening on a machine with 32 bit addresses. The simplest way for the cache to map the memory into the cache is to mask off the first 12 and the last 7 bits of the address, then shift to the right 7 bits. What we end up with is a cache that maps any two addresses exactly 8192 (2^13) bytes apart in main memory to the same cache line. If the program happens to use an array of 8192 byte structs, and refers to just one element in each one while processing the whole array, every access will map to the same cache line and force a reload, which is considerable delay.
This seems like a contrived situation, but the same problem arises for any struct which is a multiple of 8192 (in the above example). If the struct were half the size, the problem would be half as severe (and so on) but probably still noticeable. Since we have used only one (one!) cache line, any future access to the array will have to reload cache lines, in spite of the supposed 1MB capacity.
The solution in this case is probably to isolate the one field into a separate array. Say the field is 4 bytes wide; then the cache would only need a reload every 32 iterations through the array. This is probably seldom enough that the cache lines can be read in advance of their need and processing can occur at full speed.
malloc and free have some interesting behavior at times, and are often implemented in completely different ways on different machines. Some mallocs have a way of "tuning" for performance (mallopt for example.)
In some applications, malloc takes
very little time and free takes a lot, so if your program uses a bunch of memory
but doesn't really re-use any of it before it dies, the program might
run faster if you just do this #define free(x) /* no-op */
in a header someplace. I would emphasize that this has narrow
utility. Programs with long lifetimes (for example, background demons)
and programs which use a significant portion of physical memory should
carefully free everything
as soon as it's no longer needed.
Allocate less stuff in the first place; this is kind of trite since most programmers don't go to the trouble of calling malloc unless they have a good reason, but there might be a few things that could go on the stack instead. If you have some data structure which is mostly "holes" like a hash table or sparse array, it might be to your advantage to use a smaller table or list or something.
Buddy system allocators are subject to massive amounts of internal fragmentation, and often their worst case is when you allocate something whose size is near, or slightly larger than a power of two, which is not exactly uncommon. They also tend to put things of like size together in preference to putting things allocated in sequence together; this affects locality of reference, sometimes in a good way, sometimes not.
You may have mallopt on your (Unixish) system. This can be used to get malloc to allocate small blocks fairly quickly and painlessly.
SunOS (and perhaps others) have madvise() and vadvise() system calls. These can be used to clue the OS in as to what sort of way you're going to be accessing memory: randomly, sequentially, mark some pages as no longer needed, etc.
In the current computer era, the costs of main memory, cache memory, and disk storage for VM are steadily decreasing. It is entirely possible that rewriting the code won't improve the situation as much as simply throwing money at the problem and buying more memory.
Some common sense is required, since for mass-market software this isn't always practical. One side of the argument is that you can't just tell everyone to spend $500 in upgrades just to run your $10 shareware program. On the other hand, increasing swap space by 16MB will cost, averaged over time and population, about US$4 at current (1997) rates.
In the best of all worlds, you would run a cache profiler to find out exactly what parts of your program or data structures are causing cache problems. Few development environments include one. Often, the best that one can do is infer from an execution profile that a cache problem might exist in a certain function, especially if large arrays or many data structures are involved.
I/O (in Unix) usually puts your process to sleep for a time. The request for I/O may finish fairly quickly but perhaps some other process is busy using the CPU and your process has to wait. Because the length of the wait is somewhat arbitrary and depends on what exactly the other processes are doing, I/O bound programs can be tricky to optimize.
Buffered I/O is usually (but not always) faster than unbuffered. Some gain can be wrought from using a larger than normal sized buffer; try out setvbuf(). If you aren't worried about portability you can try using lower level routines like read() and write() with large buffers and compare their performance to fread()and fwrite(). Using read() or write() in a single-character-at-a-time mode is especially slow on Unix machines because of the system call overhead.
Consider using mmap() if you have it. This can save effort in several ways. The data doesn't have to go through stdio which saves a buffer copy. Depending on the sophistication of the paging hardware, the data need not even be copied into user space; the program can just access an existing copy. mmap() also lends itself to read-ahead; theoretically the entire file could be read into memory before you even need it. Lastly, the file is can be paged directly off the source disk and doesn't have to use up virtual memory.
Again, consider mmap() if you have it. If there's a trade off between I/O bound and memory bound in your program, consider a lazy-free of records: when memory gets tight, free unmodified records and write out modified records (to a temporary file if need be) and read them in later. Though if you take the disk space you'd use to write the records out and just add it to the paging space instead you'll save yourself a lot of hassle.
You can set up a file descriptor to be non-blocking (see ioctl(2) or fcntl(2) man pages) and arrange to have it send your process a signal when the I/O is done; in the meantime your process can get something else done, including perhaps sending off other I/O requests to other devices. Significant parallelism may result at the cost of program complexity.
Multithreading packages can aid in the construction of programs which utilize asynchronous I/O.
If your program spews out a lot of data to the screen, it's going to run slow on a 1200 baud line. Waiting for the screen to catch up stops your program. This doesn't add to the CPU or disk time as reported for accounting purposes, but it sure seems slow to the user. Bitmap displays are subject to a similar problem: xterms with jump scroll mode off can be quite slow at times.
A general solution is to provide a way for the user to squelch out irrelevant data. Screen handling utilities like curses can speed things up by reducing wasted screen updates.
These have some weird properties. You may find that sending short, infrequent messages is extremely slow. This is because small messages may just sit in a buffer for a while waiting for the rest of the buffer to get filled up. There's some socket options you can set to get around this for TCP; or switch to an non-streaming protocol such as UDP. But the usual limitation is the network itself and how busy it is.
For the most part there's no free lunch and sending more data simply takes more time. However, a local network with a switched hub, bandwidth should have a clear theoretical upper limit. If you aren't getting anywhere near it, you might be able to blame the IP implementation being used with your OS. Try switching to a different "IP stack" and see if that helps a bit.
On Unix machines there are typically two socket libraries available, BSD sockets, and TLI. One is usually implemented in terms of the other and may suffer an extra buffer copy or other overhead. Try them both and see which is faster.
This replacement for stdio written by Phong Vo and Dave Korn is available through netlib@research.att.com. By slightly changing the calling sequences they have enabled several nifty optimizations and claim a substantial improvement.
Some of the things that may trip up novices:
number of runs × number of users × time savings ×even if the program will be run hundreds of times by thousands of users, an extra day spent saving 40 milliseconds probably isn't going to help.
user's salary - time spent optimizing × programmer's salary
A particular problem is that many programmers are power users and have machines loaded with memory and a math co-processor and tons of disk space while the users get bare-bones CPUs and run over a network. The programmer gets a distorted view of the program's performance and may fail to optimize the parts of the program which are taking up the user's time, such as floating point or memory intensive routines.
Alfred V. Aho, Ravi Sethi, Jeffrey D. Ullman Compilers: Principles, Techniques, and Tools (The "Dragon Book") ISBN 0-201-10088-6
The Dragon Book has quite a bit of detail about the inner workings of optimizers and what types of optimizations are easiest for the compiler to perform. By reading it you may gain some insight into what you can reasonably expect a modern compiler to do for you. There is an entry in the Jargon file for it as well:"Dragon Book n. The classic text "Compilers: Principles, Techniques and Tools", by Alfred V. Aho, Ravi Sethi, and Jeffrey D. Ullman (Addison-Wesley 1986; ISBN 0-201-10088-6), so called because of the cover design featuring a dragon labeled `complexity of compiler design' and a knight bearing the lance `LALR parser generator' among his other trappings. This one is more specifically known as the `Red Dragon Book' (1986); an earlier edition, sans Sethi and titled "Principles Of Compiler Design" (Alfred V. Aho and Jeffrey D. Ullman; Addison-Wesley, 1977; ISBN 0-201-00022-9), was the `Green Dragon Book' (1977). (Also `New Dragon Book', `Old Dragon Book'.) The horsed knight and the Green Dragon were warily eying each other at a distance; now the knight is typing (wearing gauntlets!) at a terminal showing a video-game representation of the Red Dragon's head while the rest of the beast extends back in normal space. See also book titles."
American National Standard for Information Systems -- Programming Language -- C. ANSI X3.159-1989 aka ISO 9899-1990
Optimizers will sometimes mangle code that depends on side effects or order of evaluation so you should be aware of what's legal and what's not. By sticking with the letter of the law you should be able to use the highest level of optimization safely. The Rationale (available at several ftp sites) points out some of the optimization-related weirdnesses of C.Dowd, Kevin & Severance, Charles. High Performance Computing, 2nd Edition. O'Reilly & Associates. ISBN 1-56592-312-XI've referred to ANSI C throughout the document; substitute the term ISO C if you prefer.
This is a good introduction and reference for optimization of programs for RISC and other modern architectures. There is coverage of programming for parallelism, techniques for minimizing memory cache misses, and a lot of common sense guidelines sprinkled throughout about the proper way to approach the problem. Ordering information is available at the O'Reilly & Associates web site:
Kernighan, Brian & Ritchie, Dennis. The C Programming Language Second Edition. Prentice Hall, Englewood Cliffs NJ. ISBN 0-13-110370-9
A more widely available, more readable, and cheaper reference than the Standard. There isn't any specific advice about optimization though.
Knuth, Donald. The Art of Computer Programming Vol. 3 Searching and Sorting. Addison-Wesley, Reading MA. ISBN 0-201-03803-X
Knuth presents in excruciating detail the time and space complexity of nearly every imaginable search and sort algorithm. Since many programs are constrained either by a search or sort, or by some other algorithm which boils down to one of those, you may find this interesting reading. The examples are written in rather odd pseudo-assembly language called MIX which appears (to me) to have been designed specifically so that you can talk about exactly how many CPU cycles something takes.
Alvin R. Lebeck and David A. Wood "Cache Profiling and the
SPEC Benchmarks: A Case Study,"
IEEE Computer, vol. 27, no. 10, Oct.
1994, pp. 15-26.
Research paper describes a UNIX/X-windows cache-profiler system and their success in using it to optimize the SPEC family of benchmarks.
Mulder, Art. comp.windows.x: Getting more
performance out of X.
Many nifty GUI applications are constrained by limitations of the X client/server protocol or by the speed of the server itself. A profiling program run only on the application itself will not be able to pinpoint the true bottlenecks since they are in a different process. This describes some of the common problems and what to do about them, both from a programmer's viewpoint and for regular users.
NULLSTONE Corporation. NULLSTONE
Optimization Categories,
A laundry list of the optimizations that compilers can and should do.
Pure Atria. Performance Engineering,
One corporation's viewpoint on optimization strategies.Sedgewick, R. Algorithms in C. Addison-Wesley, Reading MA, 1990. ISBN 0-201-51425-7.
Contains most of the classic algorithms, well presented.Standish, Thomas. Data Structure Techniques. Addison-Wesley, Reading MA. ISBN 0-201-07526-4.
This has a fairly complete description of everything you could ever want to know about hash tables, stacks, linked lists, arrays, strings, trees, queues and the algorithms for maintaining each. My copy is fairly old so it may be out of print or revised or something by now.Summit, Steve et al. Comp.lang.c Answers to Frequently Asked Questions.
Very little of this has to do with optimization specifically, but as with the ANSI standard, it is valuable to know what portions of the language are safe to exercise to their limits before you go a-hacking. Also available in book form.
Unix man pages. gprof(1) tcov(1) prof(1) time(1) csh(1) csh_builtins(1) monitor(3) cc(1) lorder(1) tsort(1) malloc(3) Included with most installations of the Unix[tm] operating system.
Specifically, check the man pages to find out what compile options you need for profiling. The SEE ALSO section of man pages will occasionally point you in interesting directions.
Abrash, Michael. Zen of Program Optimization. ISBN 1-883577-03-9
Lots of details about optimizing Intel x86 code.
Bentley, Jon. Writing Efficient Programs. Prentice Hall. ISBN 0-13-970244-X.
Bentley, Jon. Programming Pearls. Addison-Wesley. 1986/1989 ISBN 0-201-10331-1
Jackson, Michael A. Principles of Program Design. Academic Press, London and New York, 1975.
There are two rules for when to optimize:
- Don't do it.
- (For experts only) Don't do it yet.
Kernighan, Brian W and Plauger, P.J. The Elements of Programming Style Second Edition. McGraw-Hill. 1978. ISBN 0-07-034207-5.
"The Elements of Programming Style has the most compelling discussions on efficiency (not to mention numerous other matters) that I've ever had the pleasure to read. Many of the programming blunders which the book dissects stem from misguided or heavyhanded optimization attempts; chapter 7 is devoted to Efficiency and Instrumentation, with some extremelyconvincing examples of (deprecated) code which uses poor algorithms and/or simpleminded source-level slower than straightforward code."Kozen, Dexter C. The Design and Analysis of Algorithms. Springer-Verlag, New York. 1992.
"The book treats a number of advanced algorithms and data structures, among the topics covered are: Splay Trees, Binomial Heaps, Fibonacci Heaps, Union-Find algorithms and Random Search Trees. It is not a book that I would recommend to someone who doesn't have a good understanding of algorithmics and mathematics, some of the topics covered are also very specialized, but it may be a worthwhile addition to some of the more basic books on algorithmics."Kruse, Robert L.; Bruce P. Leung; Clovis L. Tondo. Data Structures and Program Design in C. Prentice Hall. ISBN 0-13-725649-3
H. Massalin, "Superoptimizer: a look at the smallest program", Proc. ASPLOS II, 1987, pp. 122-127.
Spuler, David. C++ and C Efficiency. Prentice Hall, Sydney. ISBN 0-13-096595-2
Wirth, Niklaus. Algorithms + Data Structures = Programs.
I would especially like to thank these people for their help with grammar, catching typos, assistance with the structure of the document and most of all for taking the time to explain some of these optimization techniques to me.
If you have any suggestions or comments please let me know.
Copyright © 1997 Michael E. Lee
mikey -at- ontek.com