Chinaunix首页 | 论坛 | 博客
  • 博客访问: 92509
  • 博文数量: 17
  • 博客积分: 10
  • 博客等级: 民兵
  • 技术积分: 80
  • 用 户 组: 普通用户
  • 注册时间: 2011-03-19 12:47
文章分类

全部博文(17)

文章存档

2015年(2)

2014年(15)

我的朋友

分类: LINUX

2014-11-04 23:41:43

原文地址:Linux内核OOM机制分析 作者:frankzfz

最近在工作中遇到下面的问题:
    active_anon:16777 inactive_anon:13946 isolated_anon:0
 active_file:14 inactive_file:37 isolated_file:0
 unevictable:0 dirty:0 writeback:0 unstable:0
 free:2081 slab_reclaimable:299 slab_unreclaimable:26435
 mapped:53 shmem:171 pagetables:289 bounce:0
Normal free:8324kB min:2036kB low:2544kB high:3052kB active_anon:67108kB inactive_anon:55784kB active_file:56kB inactive_file:148kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:260096kB mlocked:0kB dirty:0kB writeback:0kB mapped:212kB shmem:684kB slab_reclaimable:1196kB slab_unreclaimable:105740kB kernel_stack:648kB pagetables:1156kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
lowmem_reserve[]: 0 0
Normal: 655*4kB 663*8kB 17*16kB 4*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 8324kB
222 total pagecache pages
0 pages in swap cache
Swap cache stats: add 0, delete 0, find 0/0
Free swap  = 0kB
Total swap = 0kB
65536 pages of RAM
2982 free pages
3251 reserved pages
26734 slab pages
709 pages shared
0 pages swap cached
Out of memory: kill process 6184 (XXX) score 166 or a child
Killed process 6184 (XXX)

       从上面的打印信息可以看出,内存不足触发了Linux内核的OOM机制   
 Linux 下有个特性叫作 OOM killer(Out of Memory),从字面的意思可以看出和内存溢出相关,当内存耗尽时,该问题就会出现。在Linux2.6.内核中,当该功能打开后,在内存耗尽时,会根据一定的值计算出一个合适的用户空间的进程给kill掉,以便释放更多的内存,保证整个系统的稳定运行。在系统的日志中通常会有下面的打印日志:Out of memory: kill process 959 (sshd) score 55 or a child。 

1.       OOM什么时候出现?

我们在用户空间申请内存时,一般使用的是malloc,是不是当malloc返回为空时,没有可以申请的内存空间了就会返回呢?答案是否定的。在关于malloc的申请内存的机制中有下面的一段描述:

By default, Linux follows an optimistic memory allocation strategy. This means that when malloc() returns non-NULL there is no guarantee that the memory really is available. This is a really bad bug. In case it turns out that the system is out of memory, one or more processes will be killed by the infamous OOM killer. In case Linux is employed under circumstances where it would be less desirable to suddenly lose some randomly picked processes, and moreover the kernel version is sufficiently recent, one can switch off this overcommitting behavior using a command like:

上面的描述中说明了在Linux中当malloc返回的是非空时,并不代表有可以使用的内存空间。Linux系统允许程序申请比系统可用内存更多的内存空间,这个特性叫做overcommit特性,这样做可能是为了系统的优化,因为不是所有的程序申请了内存就会立刻使用,当真正的使用时,系统可能已经回收了一下内存。但是,当你使用时Linux系统没有内存可以使用时,OOM Killer就会出来让一些进程退出。

 Linux下有3种Overcommit的策略(参考内核文档:vm/overcommit-accounting),可以在/proc/sys/vm/overcommit_memory配置(取0,1和2三个值,默认是0)。 

(1)0:启发式策略,比较严重的Overcommit将不能得逞,比如你突然申请了128TB的内存。而轻微的overcommit将被允许。另外,root能Overcommit的值比普通用户要稍微多

(2)永远允许overcommit,这种策略适合那些不能承受内存分配失败的应用,比如某些科学计算应用。 

(3)永远禁止overcommit,在这个情况下,系统所能分配的内存不会超过swap+RAM*系数(/proc/sys/vm/overcmmit_ratio,默认50%,你可以调整),如果这么多资源已经用光,那么后面任何尝试申请内存的行为都会返回错误,这通常意味着此时没法运行任何新程序。

/proc/sys/vm # cat overcommit_ratio

50

当然我可以修改proc//oom_adj的值,这里的默认值为0,当我们设置为-17时,对于该进程来说,就不会触发OOM机制,被杀掉。

echo -17 > /proc/$(pidof sshd)/oom_adj

这里为什么是-17呢?这和Linux的实现有关系。在Linux内核中的oom.h文件中,可以看到下面的定义:

/* /proc//oom_adj set to -17 protects from the oom-killer */

#define OOM_DISABLE (-17)

/* inclusive */

#define OOM_ADJUST_MIN (-16)

#define OOM_ADJUST_MAX 15

这个oom_adj中的变量的范围为15到-16之间。越大越容易被kill。oom_score就是它计算出来的一个值,就是根据这个值来选择哪些进程被kill掉的。

总之,通过上面的分析可知,满足下面的条件后,就是启动OOM机制。

1) VM里面分配不出更多的page(注意linux kernel是延迟分配page策略,及用到的时候才alloc;所以malloc + memset才有效)。

2) 用户地址空间不足,这种情况在32bit机器上及user space超过了3GB,在64bit机器上不太可能发生。

2     当该机制被触发后,会让什么样的进程退出?

只要存在overcommit,就可能会有OOM killer。 Linux系统的选择策略也一直在不断的演化。我们可以通过设置一些值来影响OOM killer做出决策。Linux下每个进程都有个OOM权重,在/proc//oom_adj里面,取值是-17到+15,取值越高,越容易被干掉。  最终OOM killer是通过/proc//oom_score这个值来决定哪个进程被干掉的。这个值是系统综合进程的内存消耗量、CPU时间(utime + stime)、存活时间(uptime - start time)和oom_adj计算出的,消耗内存越多分越高,存活时间越长分越低。总之,总的策略是:损失最少的工作,释放最大的内存同时不伤及无辜的用了很大内存的进程,并且杀掉的进程数尽量少。  另外,Linux在计算进程的内存消耗的时候,会将子进程所耗内存的一半同时算到父进程中。

3.       在这里我们看一下内核是怎么实现的?

   下面的流程图是out_of_memory的调用关系,

__out_of_memory函数主要做了两件事,1. 调用select_bad_process函数选择一个最优的进程杀掉,2. 根据选择的最优的进程,调用函数oom_kill_process,杀掉该进程。


点击(此处)折叠或打开

  1. static void __out_of_memory(gfp_t gfp_mask, int order)
  2. {
  3.     struct task_struct *p;
  4.     unsigned long points;
  5. //如果sysctl_oom_kill_allocating_task值设置了,就会直接杀掉申请内存的进程。
  6.     if (sysctl_oom_kill_allocating_task)
  7.         if (!oom_kill_process(current, gfp_mask, order, 0, NULL,
  8.                 "Out of memory (oom_kill_allocating_task)"))
  9.             return;
  10. retry:
  11.     /*
  12.      * Rambo mode: Shoot down a process and hope it solves whatever
  13.      * issues we may have.
  14.      */
  15.     p = select_bad_process(&points, NULL);

  16.     if (PTR_ERR(p) == -1UL)
  17.         return;

  18.     /* Found nothing?!?! Either we hang forever, or we panic. */
  19.     if (!p) {
  20.         read_unlock(&tasklist_lock);
  21.         panic("Out of memory and no killable processes...\n");
  22.     }

  23.     if (oom_kill_process(p, gfp_mask, order, points, NULL,
  24.              "Out of memory"))
  25.         goto retry;

select_bad_process函数主要是对变量所有的进程,并对一些不符合要求的进程进行过滤,然后调用badness函数,选择一个最优的进程,然后杀掉。

点击(此处)折叠或打开

  1. static struct task_struct *select_bad_process(unsigned long *ppoints,struct mem_cgroup *mem)
  2. {
  3.     struct task_struct *p;
  4.     struct task_struct *chosen = NULL;
  5.     struct timespec uptime;
  6.     *ppoints = 0;

  7.     do_posix_clock_monotonic_gettime(&uptime);
  8.     for_each_process(p) {//遍历所有的进程包括用户进程和内核进程
  9.         unsigned long points;

  10.         /*
  11.          * skip kernel threads and tasks which have already released
  12.          * their mm. 跳过内核进程
  13.          */
  14.         if (!p->mm)
  15.             continue;
  16.         /* skip the init task 跳过Init进程*/
  17.         if (is_global_init(p))
  18.             continue;
  19.         if (mem && !task_in_mem_cgroup(p, mem))
  20.             continue;

  21.         
  22.         if (test_tsk_thread_flag(p, TIF_MEMDIE))
  23.             return ERR_PTR(-1UL);

  24.         if (p->flags & PF_EXITING) {
  25.             if (p != current)
  26.                 return ERR_PTR(-1UL);

  27.             chosen = p;
  28.             *ppoints = ULONG_MAX;
  29.         }
  30. //这里就是 #define OOM_DISABLE (-17) 也就是/proc/<pid>/oom_adj这个值
  31.         if (p->signal->oom_adj == OOM_DISABLE)
  32.             continue;
  33. //对其它的进程调用badness()函数来计算相应的score,score最高的将被选中
  34.         points = badness(p, uptime.tv_sec);
  35.         if (points > *ppoints || !chosen) {
  36.             chosen = p;
  37.             *ppoints = points;
  38.         }
  39.     }

  40.     return chosen;
  41. }

函数badness()就是根据各种条件进行判断,找到一个最应该杀死的进程。主要的选择条件是下面的几点:

(1)score初始值为该进程占用的total_vm;

(2)如果该进程有子进程,子进程独自占用的total_vm/2加到本进程score;

(3)score随着该进程的cpu_time以及run_time的增长而减少,也就是运行的时间越长,被kill掉的几率越小

(4) nice大于0的进程,score*2;

(5)对于拥有超级权限的进程,或者直接磁盘交互的进程降低score;

(6)如果和current进程在内存上没有交集,则该进程降低score;

(7)最后根据该进程的oom_adj,计算得出最终的score;

点击(此处)折叠或打开

  1. unsigned long badness(struct task_struct *p, unsigned long uptime)
  2. {
  3.     unsigned long points, cpu_time, run_time;
  4.     struct mm_struct *mm;
  5.     struct task_struct *child;
  6.     int oom_adj = p->signal->oom_adj;
  7.     struct task_cputime task_time;
  8.     unsigned long utime;
  9.     unsigned long stime;
  10.    //如果OOM是被禁止的,则直接返回。
  11.     if (oom_adj == OOM_DISABLE)
  12.         return 0;

  13.     task_lock(p);
  14.     mm = p->mm;
  15.     if (!mm) {
  16.         task_unlock(p);
  17.         return 0;
  18.     }

  19.     /*
  20.      * The memory size of the process is the basis for the badness.
  21.       该进程占用的内存大小
  22.      */
  23.     points = mm->total_vm;

  24.     /*
  25.      * After this unlock we can no longer dereference local variable `mm'
  26.      */
  27.     task_unlock(p);

  28.     /*
  29.      * swapoff can easily use up all memory, so kill those first.
  30.      */
  31.     if (p->flags & PF_OOM_ORIGIN)
  32.         return ULONG_MAX;

  33.     list_for_each_entry(child, &p->children, sibling) {
  34.         task_lock(child);
  35.  //如果该进程含有子进程,该进程子进程total_vm的一半加入到points中
  36.         if (child->mm != mm && child->mm)
  37.             points += child->mm->total_vm/2 + 1;
  38.         task_unlock(child);
  39.     }

  40.     /*
  41.      * CPU time is in tens of seconds and run time is in thousands
  42.          * of seconds. There is no particular reason for this other than
  43.          * that it turned out to work very well in practice.
  44.      */
  45.     thread_group_cputime(p, &task_time);
  46.     utime = cputime_to_jiffies(task_time.utime);
  47.     stime = cputime_to_jiffies(task_time.stime);
  48.     cpu_time = (utime + stime) >> (SHIFT_HZ + 3);


  49.     if (uptime >= p->start_time.tv_sec)
  50.         run_time = (uptime - p->start_time.tv_sec) >> 10;
  51.     else
  52.         run_time = 0;
  53. // score和进程的cpu_time以及run_time成反比,也就是该进程运行的时间越长,score值越低。
  54.     if (cpu_time)
  55.         points /= int_sqrt(cpu_time);
  56.     if (run_time)
  57.         points /= int_sqrt(int_sqrt(run_time));

  58.     /*
  59.      * Niced processes are most likely less important, so double
  60.      * their badness points. nice大于0的进程,score翻倍,nice的范围一般是-20~+19,值越大优先级越低。
  61.      */
  62.     if (task_nice(p) > 0)
  63.         points *= 2;

  64.     /*
  65.      * Superuser processes are usually more important, so we make it
  66.      * less likely that we kill those. 对设置了超级权限的进程降低score,具有超级权限的进程更加重要。
  67.      */
  68.     if (has_capability_noaudit(p, CAP_SYS_ADMIN) ||
  69.      has_capability_noaudit(p, CAP_SYS_RESOURCE))
  70.         points /= 4;

  71.     /*
  72.      * We don't want to kill a process with direct hardware access.
  73.      * Not only could that mess up the hardware, but usually users
  74.      * tend to only have this flag set on applications they think
  75.      * of as important. 对设置了超级权限的进程降低score
  76.      */
  77.     if (has_capability_noaudit(p, CAP_SYS_RAWIO))
  78.         points /= 4;

  79.     /*
  80.      * If p's nodes don't overlap ours, it may still help to kill p
  81.      * because p may have allocated or otherwise mapped memory on
  82.      * this node before. However it will be less likely.
  83. 如果和p进程在内存上没有交集的进程降低score
  84.      */
  85.     if (!has_intersects_mems_allowed(p))
  86.         points /= 8;

  87.     /*
  88.      * Adjust the score by oom_adj.
  89. 最后是根据该进程的oom_adj进行移位操作,计算最终的score,这样根据各个策略就计算出来scope值,该值越大,进程被杀死的概率也就越高
  90.      */
  91.     if (oom_adj) {
  92.         if (oom_adj > 0) {
  93.             if (!points)
  94.                 points = 1;
  95.             points <<= oom_adj;
  96.         } else
  97.             points >>= -(oom_adj);
  98.     }

  99. #ifdef DEBUG
  100.     printk(KERN_DEBUG "OOMkill: task %d (%s) got %lu points\n",
  101.     p->pid, p->comm, points);
  102. #endif
  103.     return points;
  104. }

在选择完那个进程后,调用下面的程序,发送SIGKILL信号杀死该进程,相当于用户只需Kill -9 pid 杀死进程。


点击(此处)折叠或打开

  1. static void __oom_kill_task(struct task_struct *p, int verbose)
  2. {
  3.     if (is_global_init(p)) {
  4.         WARN_ON(1);
  5.         printk(KERN_WARNING "tried to kill init!\n");
  6.         return;
  7.     }

  8.     if (!p->mm) {
  9.         WARN_ON(1);
  10.         printk(KERN_WARNING "tried to kill an mm-less task!\n");
  11.         return;
  12.     }

  13.     if (verbose)
  14.         printk(KERN_ERR "Killed process %d (%s)\n",
  15.                 task_pid_nr(p), p->comm);

  16.     p->rt.time_slice = HZ;
  17.     set_tsk_thread_flag(p, TIF_MEMDIE);

  18.     force_sig(SIGKILL, p);
  19. }

参考文献:

http://www.linuxdevcenter.com/pub/a/linux/2006/11/30/linux-out-of-memory.html

http://blog.sae.sina.com.cn/archives/2259

阅读(1728) | 评论(0) | 转发(0) |
给主人留下些什么吧!~~