sched-nice-design-CUDev-ChinaUnix博客

CUDevcudev.blog.chinaunix.net

首页　| 　博文目录　| 　关于我

CUDev

博客访问： 5815801
博文数量： 675
博客积分： 20301
博客等级：上将
技术积分： 7671
用户组：普通用户
注册时间： 2005-12-31 16:15

文章分类

全部博文（675）

Web架构（4）
Thinking（1）
SF（2）
Kernel and Drive（70）
perl（2）
QT4学习笔记（9）
网络编程（52）
嵌入式Linux（4）
服务器管理（64）
操作系统研究（11）
Linux深入学习（38）
算法研究（29）
网络安全（34）
python（19）
心情日记（6）
程序设计（127）
Linux应用（134）
Shell（64）
未分配的博文（5）

文章存档

2012年（1）

2011年（20）

2010年（14）

2009年（63）

2008年（118）

2007年（141）

2006年（318）

我的朋友

相关博文

sched-nice-design

分类： LINUX

2009-04-19 00:50:57

This document explains the thinking about the revamped and streamlined
nice-levels implementation in the new Linux scheduler.

本文解释了在新Linux调度器中如何实现改造后的nice-level.

Nice levels were always pretty weak under Linux and people continuously
pestered us to make nice +19 tasks use up much less CPU time.

nice levels在linux下的表现总是很差劲，用户不厌其烦地敦促我们想办法让nice值+19的任务占用更少的CPU时间。

Unfortunately that was not that easy to implement under the old
scheduler, (otherwise we'd have done it long ago) because nice level
support was historically coupled to timeslice length, and timeslice
units were driven by the HZ tick, so the smallest timeslice was 1/HZ.

可惜，旧调度器下这个要求并不容易达成（否则我们早就这样做了），因为nice level载体到目前为止一直跟时间片长度绑定在一起，
而时间片又依赖HZ时钟滴答驱动，因此最小的时间片就是1/HZ。

In the O(1) scheduler (in 2003) we changed negative nice levels to be
much stronger than they were before in 2.4 (and people were happy about
that change), and we also intentionally calibrated the linear timeslice
rule so that nice +19 level would be _exactly_ 1 jiffy. To better
understand it, the timeslice graph went like this (cheesy ASCII art
alert!):

2003年问世的O(1)调度器中，负数的nice levels比之前的2.4强大不少。我们特地调整线性时间片规则，将+19nice值校准为1个精确的时钟滴答。
为了更好的理解，请参照如下的时间片分布图：

                   A
             \     | [timeslice length]
              \    |
               \   |
                \ |
                 \ |
                  \|___100msecs
                   |^ . _
                   |      ^ . _
                   |            ^ . _
-*----------------------------------*-----> [nice level]
-20               |                +19
                   |
                   |

So that if someone wanted to really renice tasks, +19 would give a much
bigger hit than the normal linear rule would do. (The solution of
changing the ABI to extend priorities was discarded early on.)

因此，如果谁想重新设定任务的nice值，+19将会比通常的线性规则更接近目标（通过改变ABI来扩展属性的方法在之前已遭舍弃）。

This approach worked to some degree for some time, but later on with
HZ=1000 it caused 1 jiffy to be 1 msec, which meant 0.1% CPU usage which
we felt to be a bit excessive. Excessive _not_ because it's too small of
a CPU utilization, but because it causes too frequent (once per
millisec) rescheduling. (and would thus trash the cache, etc. Remember,
this was long ago when hardware was weaker and caches were smaller, and
people were running number crunching apps at nice +19.)

这种方法只是偶尔在某些情况下效果不错，但是，当HZ=1000时，1个时钟滴答为1毫秒，也就是需占用CPU达0.1%，这就有点浪费了。
这种浪费，不是嫌CPU使用太少，而是因为它会导致太频繁（每个毫秒）的重新调度（这样会清掉Cache等。看吧，这就像是硬件羸弱，Cache小得可怜时，人们用nice19运行数字运算程序）。

So for HZ=1000 we changed nice +19 to 5msecs, because that felt like the
right minimal granularity - and this translates to 5% CPU utilization.
But the fundamental HZ-sensitive property for nice+19 still remained,
and we never got a single complaint about nice +19 being too _weak_ in
terms of CPU utilization, we only got complaints about it (still) being
too _strong_ :-)

所以，对于HZ=1000，我们将nice +19修改成5毫秒，这样才像是正确的最小间隔，也就是5%的CPU使用（译者注：
为了使CPU Cache不被丢弃，Cache利用率提高，将nice+19任务的时间片比之前稍增加一些，达到5毫秒）。
但是，nice+19固有的HZ敏感性还是存在的。从来没人抱怨过nice+19对CPU的使用太弱，往往都是埋怨它用得太彪悍(译者注：也就是说人们依然抱怨nice+19的任务占用了太多的CPU使用)。

To sum it up: we always wanted to make nice levels more consistent, but
within the constraints of HZ and jiffies and their nasty design level
coupling to timeslices and granularity it was not really viable.

做个小结：我们一直希望nice level变得更加一致（成一个线性关系），但是，HZ和jiffies的限制，再加上
nice level与时间片和时间间隔绑定在一起的如此恶心的设计，让这个希望变得十分渺茫。

The second (less frequent but still periodically occuring) complaint
about Linux's nice level support was its asymetry around the origo
(which you can see demonstrated in the picture above), or more
accurately: the fact that nice level behavior depended on the _absolute_
nice level as well, while the nice API itself is fundamentally
"relative":
对Linux中nice level支持的
第2个抱怨（出现频率稍低，但是周而复始）就是Linux的nice level支持在0点左右的不对称（你可以通过上面图中的描述），
或者是更精确地说：
nice level是绝对nice值，但是nice API却是基于相对值的。

   int nice(int inc);
      asmlinkage long sys_nice(int increment)

(the first one is the glibc API, the second one is the syscall API.)
Note that the 'inc' is relative to the current nice level. Tools like
bash's "nice" command mirror this relative API.

（第一个是glibc API,第二个是系统调用API）
注意inc相对于当前nice的增加值，也就是在当前的nice值上进行增加操作。
像bash的nice命令这样的工具反映的就是这个API。

With the old scheduler, if you for example started a niced task with +1
and another task with +2, the CPU split between the two tasks would
depend on the nice level of the parent shell - if it was at nice -10 the
CPU split was different than if it was at +5 or +10.

在旧调度器中，如果你启动一个nice=+1的任务和nice=+2的任务，CPU在这两个任务之间的分配会
依赖父shell的nice值。父进程的nice为-10的cpu分配与nice为+5或者+10大不相同。

A third complaint against Linux's nice level support was that negative
nice levels were not 'punchy enough', so lots of people had to resort to
run audio (and other multimedia) apps under RT priorities such as
SCHED_FIFO. But this caused other problems: SCHED_FIFO is not starvation
proof, and a buggy SCHED_FIFO app can also lock up the system for good.

对Linux的nice level支持的第三个抱怨是，负数nice值不够强势，所以很多人不得不求助于在像SCHED_FIFO这样的
RT优先级下运行音频（和其他多媒体）程序。然而这样新问题又出现了：SCHED_FIFO也是要吃饭的，就是说也要占用一定的CPU，
而一个糟烂的SCHED_FIFO程序很可能会一劳永逸的霸占CPU，锁死系统。

The new scheduler in v2.6.23 addresses all three types of complaints:

2.6.23版本的新调度器就是专门解决以上三种抱怨的：

To address the first complaint (of nice levels being not "punchy"
enough), the scheduler was decoupled from 'time slice' and HZ concepts
(and granularity was made a separate concept from nice levels) and thus
it was possible to implement better and more consistent nice +19
support: with the new scheduler nice +19 tasks get a HZ-independent
1.5%, instead of the variable 3%-5%-9% range they got in the old
scheduler.

针对第一种抱怨(nice值不够强势):调度器跟时间片和HZ分离，间隔尺寸通过nice值单独产生，这样就可以实现一个
更好更一致的nice +19支持。在新调度器中，nice +19的任务得到一个独立于HZ的1.5%CPU使用，取代了之前旧调度器中3%-5%-9%的变化区间。

To address the second complaint (of nice levels not being consistent),
the new scheduler makes nice(1) have the same CPU utilization effect on
tasks, regardless of their absolute nice levels. So on the new
scheduler, running a nice +10 and a nice 11 task has the same CPU
utilization "split" between them as running a nice -5 and a nice -4
task. (one will get 55% of the CPU, the other 45%.) That is why nice
levels were changed to be "multiplicative" (or exponential) - that way
it does not matter which nice level you start out from, the 'relative
result' will always be the same.

针对第二个抱怨（nice值不一致）：新调度器不考虑nice各自的绝对nice值，令nice(1)每个任务的CPU使用相等。
因此在新调度器上，同时运行一个nice +10任务和一个nice +11任务与同时运行一个nice -5和一个nice -4任务的CPU使用"分割"是相同的。
(一个获得55%的CPU，另外一个的获得45%)这就是为什么nice值改为倍增的（或者是指数增长的）。这样，无论程序的起始nice值是多少，相对
结果都是一样的。
译者注：之前的nice level实现，增加或者是减少nice的相对值，对CPU使用的增减是不一致的。现在变得一致了，也就是说从-5到-4和10到11，减少的CPU使用是一样的。

The third complaint (of negative nice levels not being "punchy" enough
and forcing audio apps to run under the more dangerous SCHED_FIFO
scheduling policy) is addressed by the new scheduler almost
automatically: stronger negative nice levels are an automatic
side-effect of the recalibrated dynamic range of nice levels.

第三个抱怨（负数的nice值不够有力，导致音频程序运行在危险的SCHED_FIFO调度策略下）也被新调度器自动解决了：nice level重新校准动态范围的自动产物之一恰是
增强的负数。
译者注：由于新的nice level调整方法，使得我们改变nice值的相对值，能够比较好的解决好负数nice值的问题。也就是说，新的nice level
调整方法能够使得减少nice值效果更明显。

阅读(1547) | 评论(0) | 转发(0) |

上一篇：Higher Half With GDT

下一篇：screen使用技巧

给主人留下些什么吧！~~

感谢所有关心和支持过ChinaUnix的朋友们

16024965号-6