CFS调度器与O(1)调度器的比较-longddr-ChinaUnix博客

longddr的ChinaUnix博客

首页　| 　博文目录　| 　关于我

longddr

博客访问： 17072
博文数量： 1
博客积分： 10
博客等级：民兵
技术积分： 140
用户组：普通用户
注册时间： 2011-10-05 16:08

文章分类

全部博文（1）

未分配的博文（1）

文章存档

2013年（1）

我的朋友

相关博文

CFS调度器与O(1)调度器的比较

分类： LINUX

2013-07-07 16:28:57

关于O(1)调度器或者CFS调度器的分析，网上已经有不少前辈的博文做了详细的介绍，再加以赘述也没啥必要，因此，本文想结合源码详细探讨一下O1调度器和CFS调度器本质上的区别。

首先明确一下，O1调度器和CFS都是属于非实时进程的调度器，实时进程他们是不管用的。当有实时进程和传统进程同时处于可运行状态时，实时进程总是得到内核的偏爱，最先得到调度：在schedule（）中会调用

* Pick up the highest-prio task:

static inline struct task_struct *

pick_next_task(struct rq *rq, struct task_struct *prev)

{

const struct sched_class *class;

struct task_struct *p;

* Optimization: we know that if all tasks are in

* the fair class we can call that function directly:

if (likely(rq->nr_running == rq->cfs.nr_running)) {//大多数情况下，系统中没有实时进程，此处因此做了优化

p = fair_sched_class.pick_next_task(rq);

if (likely(p))

return p;

}

class = sched_class_highest;//内核总是偏爱实时进程，

for ( ; ; ) {

p = class->pick_next_task(rq);

if (p)

return p;

* Will never be NULL as the idle class always

* returns a non-NULL p:

class = class->next;

}

sched_class_highest定义如下：

#define sched_class_highest (&rt_sched_class)

这就是说只要运行队列中有实时进程，那就调度它，不管你有多少传统进程，都别想得到调度。此时神马O1，CFS都是浮云。但正常情况下，系统中是没有实时进程的，这也是上面代码做了优化的原因。因此，本文讨论的前提也是假定系统中只有传统进程，没有实时进程。

设定了前提之后，再说说调度器设计的要求：使得多个处于可运行状态的进程轮流在cpu上执行，从而产生几个程序同时运行的假象。说白了就是微观串行，宏观并行，另外，由于每个进程有不同的优先级，所以，内核需要让优先级高的进程更多的得到调度。除此之外，还需要对于IO-bound的进程提供快速响应的机制，否则影响用户体验。有了设计目标，我们再来看看这两种调度器是怎么实现这一目标的。

首先先来看看O1调度器，为了实现内核对于不同优先级的偏爱，O1调度器将进程的优先级map成时间片，说白了，就是如果优先级高，就多给点时间给你，代码如下：

* task_timeslice() scales user-nice values [ -20 ... 0 ... 19 ]

* to time slice values: [800ms ... 100ms ... 5ms]

* The higher a thread's priority, the bigger timeslices

* it gets during one round of execution. But even the lowest

* priority thread gets MIN_TIMESLICE worth of execution time.

#define SCALE_PRIO(x, prio) \

max(x * (MAX_PRIO - prio) / (MAX_USER_PRIO/2), MIN_TIMESLICE)

static unsigned int task_timeslice(task_t *p)

{

if (p->static_prio < NICE_TO_PRIO(0))

return SCALE_PRIO(DEF_TIMESLICE*4, p->static_prio);

else

return SCALE_PRIO(DEF_TIMESLICE, p->static_prio);

}

注意，此处赋予进程的时间片是以tick为单位，也就是说task_timeslice这个函数返回的是此进程下一次得到调度能够在cpu上运行的tick数，超过此tick数，进程就会被抢占。但是，O1为了偏爱I/O bound的进程，会对它进行奖励：

* This function gets called by the timer code, with HZ frequency.

* We call it with interrupts disabled.

* It also gets called by the fork code, when changing the parent's

* timeslices.

void scheduler_tick(void)

{

int cpu = smp_processor_id();

runqueue_t *rq = this_rq();

task_t *p = current;

rq->timestamp_last_tick = sched_clock();

if (p == rq->idle) {

if (wake_priority_sleeper(rq))

goto out;

rebalance_tick(cpu, rq, SCHED_IDLE);

return;

}

/* Task might have expired already, but not scheduled off yet */

if (p->array != rq->active) {

set_tsk_need_resched(p);

goto out;

}

spin_lock(&rq->lock);

* The task was running during this tick - update the

* time slice counter. Note: we do not update a thread's

* priority until it either goes to sleep or uses up its

* timeslice. This makes it possible for interactive tasks

* to use up their timeslices at their highest priority levels.

if (rt_task(p)) {

* RR tasks need a special form of timeslice management.

* FIFO tasks have no timeslices.

if ((p->policy == SCHED_RR) && !--p->time_slice) {

p->time_slice = task_timeslice(p);

p->first_time_slice = 0;

set_tsk_need_resched(p);

/* put it at the end of the queue: */

requeue_task(p, rq->active);

}

goto out_unlock;

}

if (!--p->time_slice) {

dequeue_task(p, rq->active);

set_tsk_need_resched(p);

p->prio = effective_prio(p);

p->time_slice = task_timeslice(p);

p->first_time_slice = 0;

if (!rq->expired_timestamp)

rq->expired_timestamp = jiffies;

if (!TASK_INTERACTIVE(p) || EXPIRED_STARVING(rq)) {

enqueue_task(p, rq->expired);

if (p->static_prio < rq->best_expired_prio)

rq->best_expired_prio = p->static_prio;

} else

enqueue_task(p, rq->active);//为了表示对I/O bound的进程的偏爱，当时间片耗完时，在满足一定条件下，人为增加进程的运行时间

} else {

* Prevent a too long timeslice allowing a task to monopolize

* the CPU. We do this by splitting up the timeslice into

* smaller pieces.

* Note: this does not mean the task's timeslices expire or

* get lost in any way, they just might be preempted by

* another task of equal priority. (one with higher

* priority would have preempted this task already.) We

* requeue this task to the end of the list on this priority

* level, which is in essence a round-robin of tasks with

* equal priority.

* This only applies to tasks in the interactive

* delta range with at least TIMESLICE_GRANULARITY to requeue.

if (TASK_INTERACTIVE(p) && !((task_timeslice(p) -

p->time_slice) % TIMESLICE_GRANULARITY(p)) &&

(p->time_slice >= TIMESLICE_GRANULARITY(p)) &&

(p->array == rq->active)) {

requeue_task(p, rq->active);

set_tsk_need_resched(p);

}

out_unlock:

spin_unlock(&rq->lock);

out:

rebalance_tick(cpu, rq, NOT_IDLE);

}

再分析一下CFS的调度算法，同样关注一下scheduler_tick函数：

* This function gets called by the timer code, with HZ frequency.

* We call it with interrupts disabled.

* It also gets called by the fork code, when changing the parent's

* timeslices.

void scheduler_tick(void)

{

int cpu = smp_processor_id();

struct rq *rq = cpu_rq(cpu);

struct task_struct *curr = rq->curr;

u64 next_tick = rq->tick_timestamp + TICK_NSEC;

spin_lock(&rq->lock);

__update_rq_clock(rq);

* Let rq->clock advance by at least TICK_NSEC:

if (unlikely(rq->clock < next_tick)) {

rq->clock = next_tick;

rq->clock_underflows++;

}

rq->tick_timestamp = rq->clock;

update_cpu_load(rq);

curr->sched_class->task_tick(rq, curr, 0);

update_sched_rt_period(rq);

spin_unlock(&rq->lock);

#ifdef CONFIG_SMP

rq->idle_at_tick = idle_cpu(cpu);

trigger_load_balance(rq, cpu);

#endif

}

此函数里首先更新一下runqueue中的clock，然后调用curr->sched_class->task_tick(rq, curr, 0);在cfs中此函数为

* scheduler tick hitting a task of our scheduling class:

static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)

{

struct cfs_rq *cfs_rq;

struct sched_entity *se = &curr->se;

for_each_sched_entity(se) {

cfs_rq = cfs_rq_of(se);

entity_tick(cfs_rq, se, queued);

}

此函数重点在entity_tick(cfs_rq, se, queued);

：

static void

entity_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr, int queued)

{

* Update run-time statistics of the 'current'.

update_curr(cfs_rq);

#ifdef CONFIG_SCHED_HRTICK

* queued ticks are scheduled to match the slice, so don't bother

* validating it and just reschedule.

if (queued)

return resched_task(rq_of(cfs_rq)->curr);

* don't let the period tick interfere with the hrtick preemption

if (!sched_feat(DOUBLE_TICK) &&

hrtimer_active(&rq_of(cfs_rq)->hrtick_timer))

return;

#endif

if (cfs_rq->nr_running > 1 || !sched_feat(WAKEUP_PREEMPT))

check_preempt_tick(cfs_rq, curr);

}

entity_tick中首先更新一下当前运行的进程的虚拟运行时间，然后调用check_preempt_tick来检查是否需要进行一次schedule的调度，

：

* Preempt the current task with a newly woken task if needed:

static void

check_preempt_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr)

{

unsigned long ideal_runtime, delta_exec;

ideal_runtime = sched_slice(cfs_rq, curr);

delta_exec = curr->sum_exec_runtime - curr->prev_sum_exec_runtime;

if (delta_exec > ideal_runtime)

resched_task(rq_of(cfs_rq)->curr);

}

由此可见，CFS根本不管什么所谓的I/O bound进程或者是CPU bound进程，它只是简单的计算一下你应该运行的理想的时间片，如果你运行的时间已经大于理想运行的时间，ok，那就在下个点触发一次调度。对于不同优先级的进程，它的权重是不一样的：

static void set_load_weight(struct task_struct *p)

{

if (task_has_rt_policy(p)) {

p->se.load.weight = prio_to_weight[0] * 2;

p->se.load.inv_weight = prio_to_wmult[0] >> 1;

return;

}

* SCHED_IDLE tasks get minimal weight:

if (p->policy == SCHED_IDLE) {

p->se.load.weight = WEIGHT_IDLEPRIO;

p->se.load.inv_weight = WMULT_IDLEPRIO;

return;

}

p->se.load.weight = prio_to_weight[p->static_prio - MAX_RT_PRIO];//用来map权重的数组

p->se.load.inv_weight = prio_to_wmult[p->static_prio - MAX_RT_PRIO];

}

不同权重的进程，在一个targeted_latency中分得的时间片也不一样，优先级高的进程，权重就高，获得的ideal_time也多，这样在一个targeted_latency中运行的时间也多，这就是CFS的算法。另外需要注意的是，此处ideal_time的单位不再是tick，而是ns。也就是说，CFS度量一个进程是否需要被抢占已经完全脱离了tick的束缚，与tick无关。

最后，来总结下O1和CFS的异同，以及他们各自的优缺点。首先，O1是通过按照进程的动态优先级来分配时间片，而且此时间片是以tick为单位，以这样一种方式，来达到偏爱(静态优先级高以及I/O bound)的进程，它在识别I/Obound的进程时使用了一系列复杂公式。而CFS是以进程的静态优先级作为它的权重，来分配CPU的power。CFS无需识别I/O bound的进程，因为这样的进程通常在睡眠，所以，它使用的CPU的power自然就少，当它被唤醒时，自然容易得到调度。

另外，由于O1调度算法将进程的优先级直接映射到时间片，这就导致一个问题，因为高优先级的进程得到的时间片多，这就使得两个同等优先级的进程处于刻运行状态时，反而高优先级的进程调度的间隔小于低优先级。比如，nice值-10的进程时间片为100个tick，nice值为10的进程时间片为10个tick，然而当两个nice值为-10的进程同时运行时，进程切换时间为100个tick，而nice值为10的切换时间却为10个tick。但是，nice值高的往往属于I/O bound的进程，本应该更短的切换。反观CFS，假设target_latency为80个tick，只要是nice值相同的两个进程，无论他们优先级是高还是低，都能分得target_latency/2的时间片，即50%的CPU power。

同样的，对于O1而言，在不同的nice值上加一或减一，得到的效果不同。例如，nice值0和1分别为100和95，两者相差并不多，而nice值为18和19的分别为10和5，相差近一倍。而CFS，nice值增加的部分对应的比例总是恒定的。

还有，O1分配时间片时以tick为单位，这样同样的nice值，在不同的设备上map的时间就可能不一样。而CFS始终以ns为单位，因此，不存在这样的问题。

最后，上面提到O1为了偏爱I/O bound的进程，如最开始所讲，会人为赋予这类进程更多的时间片。这其实在特定情况下，也会使I/O bound的进程占用太多本不该占用的时间。而CFS并不会多给任何进程时间片。

其实，上面所有的问题，都归根于设计理念的不同。O1实现的是进程的优先级对于时间片的绝对映射，而CFS实现的是优先级对于power的映射。因此，O1进程切换时间相对恒定而达不到进程间分享cpu的相对公平。而CFS先划定一个targeted_latency，在此基础上，分配各个进程的时间片（实际上是cpu的power），这样虽然切换时间不恒定，但达到进程之间的相对公平。

总体上来看，CFS设计简洁，没有太多的启发式复杂的算法，而且避免了O1之前带来的很多问题。而且因为它的数据接口使用的是red-black tree，虽然比不上o1，但log(n)在性能上也够用。所以，CFS最终以它简单而健壮的架构，取代O1，成为新一代linux kernel的调度器。

阅读(4237) | 评论(8) | 转发(3) |

上一篇：没有了

下一篇：没有了

给主人留下些什么吧！~~

感谢所有关心和支持过ChinaUnix的朋友们

16024965号-6