更多精品http://shop65927331.taobao.com
分类: LINUX
2010-07-13 15:39:46
The real work actually takes place inside the glibc
memory allocator. The allocator hands out blocks to the application, carving them from the heap that comes (however infrequently) from the kernel.
The allocator is the manager, while the kernel is the worker. With this in mind, it's easy to understand that maximum efficiency comes from a good allocator, not from the kernel.
glibc
uses an allocator named ptmalloc
. Wolfram Gloger created it as a modified version of the original malloc
library created by Doug Lea. The allocator manages the allocated blocks in terms of "chunks." Chunks represent the memory block you actually requested, but not its size. There is an extra header added inside this chunk besides the user data.
The allocator uses two functions to get a chunk of memory from the kernel:
brk()
sets the end of the process's data segment.
mmap()
creates a new VMA and passes it to the allocator. Of course, malloc()
uses these functions only if there are no more free chunks in the current pool.
The decision on whether to use brk()
or mmap()
requires one simple check. If the request is equal or larger than M_MMAP_THRESHOLD
, the allocator uses mmap()
. If it is smaller, the allocator calls brk()
. By default, M_MMAP_THRESHOLD
is 128KB, but you may freely change it by using mallopt()
.
In the OOM context, how ptmalloc
frees memory blocks is interesting. Blocks allocated via mmap()
get freed via an unmap()
call, and thus become completely released. Freeing blocks allocated via brk()
means marking them as free, but they remain under the allocator's control. It can reassign free chunks to satisfy another malloc()
if the request's size is less than or equal to the chunk's size. The allocator can consolidate multiple free chunks, as long as they are adjacent. It may even split a free chunk into smaller chunks to satisfy smaller future requests.
This implies that a free chunk may go abandoned if the allocator cannot fit future requests within it. Failure to coalesce free chunks may also trigger faster OOM. This is usually an indication of moderate to bad memory fragmentation.
Once an OOM situation occurs, now what? The kernel will terminate one process for sure. Why kill? This is the only way to stop further memory requests. The kernel can not assume there is a sophisticated mechanism inside the process to stop further requests automatically, so it has no other choice but to kill it.
How does the kernel know exactly which process to kill? The answer lies inside mm/oom_kill.c of the Linux source code. This C code represents the so-called OOM killer of the Linux kernel. The function badness()
give a score to each existing processes. The one with highest score will be the victim. The criteria are:
The process with the biggest score "wins" the election and the OOM killer will kill it very soon.
The heuristic isn't perfect, but usually it works well for most situations. Criteria #1 and #2 clearly show that it is the VMA size that matters, not the number of actual pages a process has. You might think that measuring VMA size will trigger a false alarm, but luckily it doesn't. The badness()
call occurs inside the page allocation functions when there are few free pages left and page frame reclamation fails, so the VMA size closely matches the number of pages owned by the process.
Why not just count the actual number of pages? That would require more time and require the use of locks, thus making the procedure too expensive to make a fast decision. Knowing that OOM killer isn't perfect, you must be ready for a wrong kill.
The kernel uses the SIGTERM
signal to inform the target process that it should stop.
The simple rule to avoid OOM risk is actually simple: don't allocate beyond the machine's current free space. However, many factors come into play, so there are further refinements to the strategy.
There is no need to use any sophisticated allocator. You can reduce fragmentation by properly ordering memory allocation and deallocation. As holes easily happen, employ the LIFO strategy: the last one you allocate is the first you need to free.
For example, instead of doing:
void *a;
void *b;
void *c;
............
a = malloc(1024);
b = malloc(5678);
c = malloc(4096);
......................
free(b);
b = malloc(12345);
It's better to do:
a = malloc(1024);
c = malloc(4096);
b = malloc(5678);
......................
free(b);
b = malloc(12345);
This way, there won't be any hole between the a
and c
chunks. You can also consider realloc()
to resize any existingmalloc()
-ed blocks.
Two example programs (fragmented1.c and fragmented2.c) demonstrate the effect of allocation rearrangement. Reports at the end of both programs give the number of bytes allocated by the system (kernel and glibc
allocator) and the number of bytes actually used. For example, on kernel 2.6.11.1, with glibc
2.3.3-27 and executing without giving an explicit parameter, fragmented1
wasted 319858832 bytes (about 305 MB) while fragmented2
wasted 2089200 bytes (about 2MB). That's 152 times smaller!
You can do further experiments by passing various numbers as the program parameter. This parameter acts as the request size of the malloc()
call.
You can change the behavior of the Linux kernel through the /proc filesystem, as documented in Documentation/vm/overcommit-accounting in the Linux kernel's source code. You have three choices when tuning kernel overcommit, expressed as numbers in /proc/sys/vm/overcommit_memory:
0
means that the kernel will use predefined heuristics when deciding whether to allow such an overcommit. This is the default.
1
always overcommits. Perhaps you now realize the danger of this mode.
2
prevents overcommit from exceeding a certain watermark. The watermark is also tunable through /proc/sys/vm/overcommit_ratio. Within this mode, the total commit can not exceed the swap space(s) size + overcommit_ratio percent * RAM size. By default, the overcommit ratio is 50. The default mode usually work quite fine in most situation, but mode #2 offers better protection toward overcommit. On the other hand, mode #2 requires you to predict carefully how much space all running applications need. You certainly don't want to see your application unable to get more memory chunks just because the limit is too strict. However, mode #2 is a best way to avoid having a program killed suddenly.
Suppose that you have 256MB of RAM and 256MB of swap and you want to limit overcommit at 384MB. That means 256 + 50 percent * 256MB, so put 50 on /proc/sys/vm/overcommit_ratio.
NULL
Pointer after Memory Allocation and Audit for Memory LeakThis is a simple rule, but it sometimes goes omitted. By checking for NULL
, at least you know that the allocator could extend the memory area, although there is no obvious guarantee that it will allocate the needed pages later. Usually, you need to bail out or delay the allocation for a moment, depending on your scenarios. Together with overcommit tunables, you have a decent tool to anticipate OOM because malloc()
will return NULL
if it believes that it cannot acquire free pages later.
Memory leak is also a source of unnecessary memory consumption. A leaked memory block is one that the application no longer tracks, but that the kernel will not reclaim because, from the kernel's point of view, the task still has it under control. Valgrind is a nice tool to find out such occurrences inside your code without the need to re-code.
The Linux kernel provides /proc/meminfo as a way to find complete information about memory conditions. This /proc entry is also an information source for utilities such as top
, free
, and vmstat
.
What you need to check is the free and reclaimable memory. The word "free" needs no further explanation, but what does "reclaimable" mean? It refers to buffers and page caches--the disk cache. They are reclaimable because, when memory is tight, the Linux kernel can simply flush them out back to the disk. These are file-backed pages. I've lightly edited this example of memory statistics:
$ cat /proc/meminfo
MemTotal: 255944 kB
MemFree: 3668 kB
Buffers: 13640 kB
Cached: 171788 kB
SwapCached: 0 kB
HighTotal: 0 kB
HighFree: 0 kB
LowTotal: 255944 kB
LowFree: 3668 kB
SwapTotal: 909676 kB
SwapFree: 909676 kB
Based on this above output, the free virtual memory is MemFree + Buffers + Cached + SwapFree = 1098772 kB.
I failed to find any formalized C (glibc
) function to find out free (including reclaimable) memory space. The closest I found is by using get_avphys_pages()
or sysconf()
(with the _SC_AVPHYS_PAGES
parameter). They only report the amount of free memory, not the free + reclaimable amount.
That means to get precise information, you must programmatically parse the /proc/meminfo
and calculate it by yourself. If you're lazy, take the procps
source package as a reference on how to do it. This package contains tools such as ps
, top
, and free
. It is available under the GPL.
Different allocators yield different ways to manage memory chunks and to shrink, expand, and create virtual memory areas. is one example. Emery Berger from the University of Massachusetts wrote it as a high performance memory allocator. Hoard seems to work best for multi-threaded applications; it introduces the concept of per-CPU heap.
Users who need larger user address spaces can consider using 64-bit platforms. The Linux kernel no longer uses the 3:1 VM split for these machines. In other words, user space becomes quite large. It can be a good match for machines with more than 4GB of RAM.
This has no connection to extended addressing schemes, such as Intel's Physical Address Extension (PAE), which allows a 32-bit Intel processor to address up to 64GB of RAM. This addressing deals with physical address, while in the virtual address context, the user space itself is still 3GB (assuming the 3:1 VM split). This extra memory is reachable, but not all mappable into the address space. Unmappable portions of RAM are unusable.
Packed attributes can help to squeeze the size of struct
s, enum
s, and union
s. This is a way to save more bytes, especially for array of struct
s. Here is a declaration example:
struct test
{
char a;
long b;
} __attribute__ ((packed));
The con for this action is that it makes certain field(s) unaligned and thus it costs more CPU cycles to access the field. "Aligned" here means the variable's address is a multiple of its data type's natural size. The net result is that, depending on the data access frequency, the runtime may get relatively slower. However, take into account page alignment and cache coherence.
ulimit()
for User ProcessesWith ulimit -v
, you can limit the address space a process can allocate with mmap()
. When you reach the limit, all mmap()
, and hence malloc()
, calls will return 0 and the kernel's OOM killer will never start. This is most useful in a multi-user environment where you cannot trust all of the users and want to avoid killing random processes.
The author gives credits to several people for their assistance and help: Peter Ziljtra, Wolfram Gloger, and Rene Hermant. Mr. Gloger also contributed the ulimit()
technique.
free()
" by Anonymous, Phrack Volume 0x0b, Issue 0x39, Phile #0x09 of 0x12.