分类: LINUX
2010-07-15 23:08:22
At the bottom of this problem lie other questions: how much memory do you want to allocate? How much does the operating system (OS) allocate for you? The basic reason of OOM is simple: you've asked for more than the available virtual memory space. I say "virtual" because RAM isn't the only place counted as free memory; any swap areas apply.
To begin exploring OOM, first type and run this code snippet that allocates huge blocks of memory:
#include
#include
#define MEGABYTE 1024*1024
int main(int argc, char *argv[])
{
void *myblock = NULL;
int count = 0;
while (1)
{
myblock = (void *) malloc(MEGABYTE);
if (!myblock) break;
printf("Currently allocating %d MB\n", ++count);
}
exit(0);
}
Compile the program, run it, and wait for a moment. Sooner or later it will go OOM. Now compile the next program, which allocates huge blocks and fills them with 1:
#include
#include
#define MEGABYTE 1024*1024
int main(int argc, char *argv[])
{
void *myblock = NULL;
int count = 0;
while(1)
{
myblock = (void *) malloc(MEGABYTE);
if (!myblock) break;
memset(myblock,1, MEGABYTE);
printf("Currently allocating %d MB\n",++count);
}
exit(0);
}
Notice the difference? Likely, program A allocates more memory blocks
than program B does. It's also obvious that you will see the word
"Killed" not too long after executing program B. Both programs end for
the same reason: there is no more space available. More specifically,
program A ends gracefully because of a failed malloc()
. Program B ends
because of the Linux kernel's so-called OOM killer.
The first fact to observe is the amount of allocated blocks. Assume that you have 256MB of RAM and 888MB of swap (my current Linux settings). Program B ended at:
Currently allocating 1081 MB
On the other hand, program A ended at:
Currently allocating 3056 MB
Where did A get that extra 1975MB? Did I cheat? Of course not! If you look closer on both listings, you will find out that program B fills the allocated memory space with 1s, while A merely simply allocates without doing anything. This happens because Linux employs deferred page allocation. In other words, allocation doesn't actually happen until the last moment you really use it; for example, by writing data to the block. So, unless you touch the block, you can keep asking for more. The technical term for this is optimistic memory allocation.
Checking /proc/
$ cat /proc//status
VmPeak: 3141876 kB
VmSize: 3141876 kB
VmLck: 0 kB
VmHWM: 12556 kB
VmRSS: 12556 kB
VmData: 3140564 kB
VmStk: 88 kB
VmExe: 4 kB
VmLib: 1204 kB
VmPTE: 3072 kB
Here's program B, shortly before the OOM killer struck:
$ cat /proc//status
VmPeak: 1072512 kB
VmSize: 1072512 kB
VmLck: 0 kB
VmHWM: 234636 kB
VmRSS: 204692 kB
VmData: 1071200 kB
VmStk: 88 kB
VmExe: 4 kB
VmLib: 1204 kB
VmPTE: 1064 kB
VmRSS deserves further explanation. RSS stands for "Resident Set
Size." It explains how many of the allocated blocks owned by the task
currently reside in RAM. Also note that before B reaches OOM, swap usage
is almost 100 percent (most of the 888MB), while A uses no swap at all.
It's clear that malloc()
itself did nothing more than just preserve a memory area, nothing else.
Another question also arises. "Even without touching the pages, why is the allocation limit 3056MB?" This exposes an unseen limit. For every application in a 32-bit system, there is 4GB of address space available for usage. The Linux kernel usually splits the linear address to provide 0 to 3GB for user space and 3GB to 4GB for kernel space. User space is a room where a task can do anything it wants, while kernel space is solely for the kernel. If you try to cross this 3GB border, you will get a segmentation fault.
The conclusion is that OOM happens for two technical reasons:
Thus the strategies to prevent those circumstances are:
When you ask for a memory block, usually by using malloc()
, you're asking the
runtime C library whether a preallocated block is available. This
block's size must at least equal the user request. If there is
already a memory block available, malloc()
will assign this block to the user and mark it as "used." Otherwise, malloc()
must allocate more
memory by extending the heap. All requested blocks go in an area called
the heap. Do not confuse it with the stack, because the stack
stores local variable and function return addresses. These two sections
have different jobs.
Where is the heap located in the address space? The process address map can tell you exactly where:
$ cat /proc/self/maps
0039d000-003b2000 r-xp 00000000 16:41 1080084 /lib/ld-2.3.3.so
003b2000-003b3000 r-xp 00014000 16:41 1080084 /lib/ld-2.3.3.so
003b3000-003b4000 rwxp 00015000 16:41 1080084 /lib/ld-2.3.3.so
003b6000-004cb000 r-xp 00000000 16:41 1080085 /lib/tls/libc-2.3.3.so
004cb000-004cd000 r-xp 00115000 16:41 1080085 /lib/tls/libc-2.3.3.so
004cd000-004cf000 rwxp 00117000 16:41 1080085 /lib/tls/libc-2.3.3.so
004cf000-004d1000 rwxp 004cf000 00:00 0
08048000-0804c000 r-xp 00000000 16:41 130592 /bin/cat
0804c000-0804d000 rwxp 00003000 16:41 130592 /bin/cat
0804d000-0806e000 rwxp 0804d000 00:00 0 [heap]
b7d95000-b7f95000 r-xp 00000000 16:41 2239455 /usr/lib/locale/locale-archive
b7f95000-b7f96000 rwxp b7f95000 00:00 0
b7fa9000-b7faa000 r-xp b7fa9000 00:00 0 [vdso]
bfe96000-bfeab000 rw-p bfe96000 00:00 0 [stack]
This is an actual address space layout shown for cat
, but you may get different
results. It is up to the Linux kernel and the runtime C library to
arrange them. Notice that recent Linux kernel versions (2.6.x) kindly
label the memory area, but don't completely rely on them.
The heap is basically free space not already given for program mapping and stack; thus, it narrows down the available address space. It's not a full 3GB, but it's 3GB minus everything else that's mapped. The bigger your program's code segment is, the less space you have for heap. The more dynamic libraries you link into your program, the less space you get for the heap. This is important to remember.
How does the map for program A look when it can't allocate more memory blocks? With a trivial change to pause the program (see loop.c and loop-calloc.c) just before it exits, the final map is:
0009a000-0039d000 rwxp 0009a000 00:00 0 ---------> (allocated block)
0039d000-003b2000 r-xp 00000000 16:41 1080084 /lib/ld-2.3.3.so
003b2000-003b3000 r-xp 00014000 16:41 1080084 /lib/ld-2.3.3.so
003b3000-003b4000 rwxp 00015000 16:41 1080084 /lib/ld-2.3.3.so
003b6000-004cb000 r-xp 00000000 16:41 1080085 /lib/tls/libc-2.3.3.so
004cb000-004cd000 r-xp 00115000 16:41 1080085 /lib/tls/libc-2.3.3.so
004cd000-004cf000 rwxp 00117000 16:41 1080085 /lib/tls/libc-2.3.3.so
004cf000-004d1000 rwxp 004cf000 00:00 0
005ce000-08048000 rwxp 005ce000 00:00 0 ---------> (allocated block)
08048000-08049000 r-xp 00000000 16:06 1267 /test-program/loop
08049000-0804a000 rwxp 00000000 16:06 1267 /test-program/loop
0806d000-b7f62000 rwxp 0806d000 00:00 0 ---------> (allocated block)
b7f73000-b7f75000 rwxp b7f73000 00:00 0 ---------> (allocated block)
b7f75000-b7f76000 r-xp b7f75000 00:00 0 [vdso]
b7f76000-bf7ee000 rwxp b7f76000 00:00 0 ---------> (allocated block)
bf80d000-bf822000 rw-p bf80d000 00:00 0 [stack]
bf822000-bff29000 rwxp bf822000 00:00 0 ---------> (allocated block)
Six Virtual Memory Areas, or VMAs, reflect the memory request. A VMA is a memory area that groups pages with the same access permission and/or the same backing file. VMAs can exist anywhere within user space, as long as that space is available.
Now you might think, "Why six? Why not a single big VMA containing
all blocks?" There are two reasons. First, it is often impossible to
find such a big "hole" to coalesce the blocks into a single VMA. Second,
the program does not ask to allocate that approximately 3GB block all
at once, but piece by piece. Thus, the glibc
allocator has complete freedom to arrange
the memory however it wants.
Why do I mention available pages? Memory allocation occurs in
page-sized granularity. This is not a limit of the operating systems,
but a feature of the Memory Management Unit (MMU) itself. Pages have
various sizes, but the normal setting for x86 is 4K. You can discover
the page size manually by using getpagesize()
or sysconf()
(with
the _SC_PAGESIZE
parameter) libc
functions. The libc
allocator manages each page: slicing them into smaller blocks,
assigning them to processes, freeing them, and so on. For example, if
your program uses 4097 bytes total, you need to use two pages, even
though in reality the allocator gives you somewhere between 4105 to 4109
bytes.
With 256MB of RAM and no swap, you have 65536 available pages. Is
that right? Not really. What you don't see is that some memory areas are
in use by kernel code and data, so they're unavailable for any other
need. There is also a reserved part of memory for emergencies or
high-priority needs. dmesg
reveals these numbers for you:
$ dmesg | grep -n kernel
36:Memory: 255716k/262080k available (2083k kernel code, 5772k reserved,
637k data, 172k init, 0k highmem)
171:Freeing unused kernel memory: 172k freed
init
refers to
kernel code and data that is only necessary for the initialization
stage; thus the kernel frees it when it is no longer useful. That leaves
2083 + 5772 + 637 = 8492KB. Practically speaking, 2123 pages are gone
from the user's point of view. If you enable more kernel features or
insert more kernel modules, you'll use up more pages for exclusive
kernel use, so be wise.
Another kernel internal data structure is the page cache. The page cache buffers data recently read from block devices. The more caching work you do, the fewer free pages you actually have--but they are not really occupied, as the kernel will reclaim them when memory is tight.
From the kernel and hardware points of view, these are the important things to remember:
There is no guarantee that allocated memory area is physically contiguous; it's only virtually contiguous.
This "illusion" comes from the way address translation works. In a protected mode environment, users always work with virtual addresses, while hardware works with physical addresses. The page directory and page tables translate between these two. For example, two blocks with starting virtual addresses 0 and 4096 could map to the physical addresses 1024 and 8192.
This makes allocation easier, because in reality it is unlikely to always get continuous blocks, especially for large requests (megabytes or even gigabytes). The kernel will look everywhere for free pages to satisfy the request, not just adjacent free blocks. However, it will do a little more work to arrange page tables so that they appear virtually contiguous.
There is a price. Because memory blocks might be non-contiguous, sometimes the L1 and L2 caches go underused. Virtually adjacent memory blocks may be spread across different physical cache lines; this means slowing down (sequential) memory access.
Memory allocation takes two steps: first extending the length of memory area and then allocating pages when needed. This is demand paging. During VMA extension, the kernel merely checks whether the request overlaps existing VMA and if the range is still inside user space. By default, it omits the check whether actual allocation can occur.
Thus it is not strange if your program asks for a 1GB block and gets it, even if in reality you have only 16MB of RAM and 64MB of swap. This "optimistic" style might not please everybody, because you might get the false hope of thinking that there are still free pages available. The Linux kernel offers tunable parameters to control this overcommit behavior.
There are two type of pages: anonymous pages and file-backed pages. A
file-backed page originates from mmap()
-ing
a file in disk, whereas an anonymous page is the kind you get when
doing malloc()
. It
has no relationship with any files at all. When the RAM becomes tight,
the kernel swaps out anonymous pages to swap space and flushes
file-backed pages to the file to give room for current requests. In
other words, anonymous pages may consume swap area while file-backed
pages don't. The only exception is for files mmap()
-ed using the MAP_PRIVATE
flag. In this case,
file modification occurs in RAM only.
This is where the understanding of swap as RAM extension comes from. Clearly, accessing the page requires bringing it back into RAM.