Linux内核OOM机制分析-yvtianzll-ChinaUnix博客

从今天开始，做个狂热者

首页　| 　博文目录　| 　关于我

yvtianzll

博客访问： 96571
博文数量： 17
博客积分： 10
博客等级：民兵
技术积分： 80
用户组：普通用户
注册时间： 2011-03-19 12:47

文章分类

全部博文（17）

八卦新闻（1）
虚拟化（2）
新技术（1）
语言（1）
tcp/ip（1）
linux（11）
未分配的博文（0）

文章存档

2015年（2）

2014年（15）

我的朋友

manshukw

相关博文

Linux内核OOM机制分析

分类： LINUX

2014-11-04 23:41:43

原文地址：Linux内核OOM机制分析作者：frankzfz

最近在工作中遇到下面的问题：
active_anon:16777 inactive_anon:13946 isolated_anon:0
active_file:14 inactive_file:37 isolated_file:0
unevictable:0 dirty:0 writeback:0 unstable:0
free:2081 slab_reclaimable:299 slab_unreclaimable:26435
mapped:53 shmem:171 pagetables:289 bounce:0
Normal free:8324kB min:2036kB low:2544kB high:3052kB active_anon:67108kB inactive_anon:55784kB active_file:56kB inactive_file:148kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:260096kB mlocked:0kB dirty:0kB writeback:0kB mapped:212kB shmem:684kB slab_reclaimable:1196kB slab_unreclaimable:105740kB kernel_stack:648kB pagetables:1156kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
lowmem_reserve[]: 0 0
Normal: 655*4kB 663*8kB 17*16kB 4*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 8324kB
222 total pagecache pages
0 pages in swap cache
Swap cache stats: add 0, delete 0, find 0/0
Free swap = 0kB
Total swap = 0kB
65536 pages of RAM
2982 free pages
3251 reserved pages
26734 slab pages
709 pages shared
0 pages swap cached
Out of memory: kill process 6184 (XXX) score 166 or a child
Killed process 6184 (XXX)

从上面的打印信息可以看出，内存不足触发了Linux内核的OOM机制
Linux 下有个特性叫作 OOM killer（Out of Memory），从字面的意思可以看出和内存溢出相关，当内存耗尽时，该问题就会出现。在Linux2.6.内核中，当该功能打开后，在内存耗尽时，会根据一定的值计算出一个合适的用户空间的进程给kill掉，以便释放更多的内存，保证整个系统的稳定运行。在系统的日志中通常会有下面的打印日志：Out of memory: kill process 959 (sshd) score 55 or a child。

1. OOM什么时候出现？

我们在用户空间申请内存时，一般使用的是malloc,是不是当malloc返回为空时，没有可以申请的内存空间了就会返回呢？答案是否定的。在关于malloc的申请内存的机制中有下面的一段描述：

By default, Linux follows an optimistic memory allocation strategy. This means that when malloc() returns non-NULL there is no guarantee that the memory really is available. This is a really bad bug. In case it turns out that the system is out of memory, one or more processes will be killed by the infamous OOM killer. In case Linux is employed under circumstances where it would be less desirable to suddenly lose some randomly picked processes, and moreover the kernel version is sufficiently recent, one can switch off this overcommitting behavior using a command like:

上面的描述中说明了在Linux中当malloc返回的是非空时，并不代表有可以使用的内存空间。Linux系统允许程序申请比系统可用内存更多的内存空间，这个特性叫做overcommit特性，这样做可能是为了系统的优化，因为不是所有的程序申请了内存就会立刻使用，当真正的使用时，系统可能已经回收了一下内存。但是，当你使用时Linux系统没有内存可以使用时，OOM Killer就会出来让一些进程退出。

Linux下有3种Overcommit的策略（参考内核文档：vm/overcommit-accounting），可以在/proc/sys/vm/overcommit_memory配置（取0,1和2三个值，默认是0）。

（1）0：启发式策略，比较严重的Overcommit将不能得逞，比如你突然申请了128TB的内存。而轻微的overcommit将被允许。另外，root能Overcommit的值比普通用户要稍微多

（2）永远允许overcommit，这种策略适合那些不能承受内存分配失败的应用，比如某些科学计算应用。

（3）永远禁止overcommit，在这个情况下，系统所能分配的内存不会超过swap+RAM*系数（/proc/sys/vm/overcmmit_ratio，默认50%，你可以调整），如果这么多资源已经用光，那么后面任何尝试申请内存的行为都会返回错误，这通常意味着此时没法运行任何新程序。

/proc/sys/vm # cat overcommit_ratio

当然我可以修改proc//oom_adj的值，这里的默认值为0，当我们设置为-17时，对于该进程来说，就不会触发OOM机制，被杀掉。

echo -17 > /proc/$(pidof sshd)/oom_adj

这里为什么是-17呢？这和Linux的实现有关系。在Linux内核中的oom.h文件中，可以看到下面的定义：

/* /proc//oom_adj set to -17 protects from the oom-killer */

#define OOM_DISABLE (-17)

/* inclusive */

#define OOM_ADJUST_MIN (-16)

#define OOM_ADJUST_MAX 15

这个oom_adj中的变量的范围为15到-16之间。越大越容易被kill。oom_score就是它计算出来的一个值，就是根据这个值来选择哪些进程被kill掉的。

总之，通过上面的分析可知，满足下面的条件后，就是启动OOM机制。

1) VM里面分配不出更多的page（注意linux kernel是延迟分配page策略，及用到的时候才alloc；所以malloc + memset才有效）。

2) 用户地址空间不足，这种情况在32bit机器上及user space超过了3GB，在64bit机器上不太可能发生。

2 当该机制被触发后，会让什么样的进程退出？

只要存在overcommit，就可能会有OOM killer。 Linux系统的选择策略也一直在不断的演化。我们可以通过设置一些值来影响OOM killer做出决策。Linux下每个进程都有个OOM权重，在/proc//oom_adj里面，取值是-17到+15，取值越高，越容易被干掉。最终OOM killer是通过/proc//oom_score这个值来决定哪个进程被干掉的。这个值是系统综合进程的内存消耗量、CPU时间(utime + stime)、存活时间(uptime - start time)和oom_adj计算出的，消耗内存越多分越高，存活时间越长分越低。总之，总的策略是：损失最少的工作，释放最大的内存同时不伤及无辜的用了很大内存的进程，并且杀掉的进程数尽量少。另外，Linux在计算进程的内存消耗的时候，会将子进程所耗内存的一半同时算到父进程中。

3. 在这里我们看一下内核是怎么实现的？

下面的流程图是out_of_memory的调用关系，

__out_of_memory函数主要做了两件事，1. 调用select_bad_process函数选择一个最优的进程杀掉，2. 根据选择的最优的进程，调用函数oom_kill_process，杀掉该进程。

点击(此处)折叠或打开

static void __out_of_memory(gfp_t gfp_mask, int order)
{
struct task_struct *p;
unsigned long points;
//如果sysctl_oom_kill_allocating_task值设置了，就会直接杀掉申请内存的进程。
if (sysctl_oom_kill_allocating_task)
if (!oom_kill_process(current, gfp_mask, order, 0, NULL,
"Out of memory (oom_kill_allocating_task)"))
return;
retry:
/*
* Rambo mode: Shoot down a process and hope it solves whatever
* issues we may have.
*/
p = select_bad_process(&points, NULL);
if (PTR_ERR(p) == -1UL)
return;
/* Found nothing?!?! Either we hang forever, or we panic. */
if (!p) {
read_unlock(&tasklist_lock);
panic("Out of memory and no killable processes...\n");
}
if (oom_kill_process(p, gfp_mask, order, points, NULL,
"Out of memory"))
goto retry;

select_bad_process函数主要是对变量所有的进程，并对一些不符合要求的进程进行过滤，然后调用badness函数，选择一个最优的进程，然后杀掉。

点击(此处)折叠或打开

static struct task_struct *select_bad_process(unsigned long *ppoints,struct mem_cgroup *mem)
{
struct task_struct *p;
struct task_struct *chosen = NULL;
struct timespec uptime;
*ppoints = 0;
do_posix_clock_monotonic_gettime(&uptime);
for_each_process(p) {//遍历所有的进程包括用户进程和内核进程
unsigned long points;
/*
* skip kernel threads and tasks which have already released
* their mm. 跳过内核进程
*/
if (!p->mm)
continue;
/* skip the init task 跳过Init进程*/
if (is_global_init(p))
continue;
if (mem && !task_in_mem_cgroup(p, mem))
continue;
if (test_tsk_thread_flag(p, TIF_MEMDIE))
return ERR_PTR(-1UL);
if (p->flags & PF_EXITING) {
if (p != current)
return ERR_PTR(-1UL);
chosen = p;
*ppoints = ULONG_MAX;
}
//这里就是 #define OOM_DISABLE (-17) 也就是/proc/<pid>/oom_adj这个值
if (p->signal->oom_adj == OOM_DISABLE)
continue;
//对其它的进程调用badness()函数来计算相应的score，score最高的将被选中
points = badness(p, uptime.tv_sec);
if (points > *ppoints || !chosen) {
chosen = p;
*ppoints = points;
}
}
return chosen;
}