分类: LINUX
2013-09-05 00:15:23
原文地址:CFS调度器与O(1)调度器的比较 作者:longddr
关于O(1)调度器或者CFS调度器的分析,网上已经有不少前辈的博文做了详细的介绍,再加以赘述也没啥必要,因此,本文想结合源码详细探讨一下O1调度器和CFS调度器本质上的区别。
首先明确一下,O1调度器和CFS都是属于非实时进程的调度器,实时进程他们是不管用的。当有实时进程和传统进程同时处于可运行状态时,实时进程总是得到内核的偏爱,最先得到调度:在schedule()中会调用
/*
* Pick up the highest-prio task:
*/
static inline struct task_struct *
pick_next_task(struct rq *rq, struct task_struct *prev)
{
const struct sched_class *class;
struct task_struct *p;
/*
* Optimization: we know that if all tasks are in
* the fair class we can call that function directly:
*/
if (likely(rq->nr_running == rq->cfs.nr_running)) {//大多数情况下,系统中没有实时进程,此处因此做了优化
p = fair_sched_class.pick_next_task(rq);
if (likely(p))
return p;
}
class = sched_class_highest;//内核总是偏爱实时进程,
for ( ; ; ) {
p = class->pick_next_task(rq);
if (p)
return p;
/*
* Will never be NULL as the idle class always
* returns a non-NULL p:
*/
class = class->next;
}
}
sched_class_highest定义如下:
#define sched_class_highest (&rt_sched_class)
这就是说只要运行队列中有实时进程,那就调度它,不管你有多少传统进程,都别想得到调度。此时神马O1,CFS都是浮云。但正常情况下,系统中是没有实时进程的,这也是上面代码做了优化的原因。因此,本文讨论的前提也是假定系统中只有传统进程,没有实时进程。
设定了前提之后,再说说调度器设计的要求:使得多个处于可运行状态的进程轮流在cpu上执行,从而产生几个程序同时运行的假象。说白了就是微观串行,宏观并行,另外,由于每个进程有不同的优先级,所以,内核需要让优先级高的进程更多的得到调度。除此之外,还需要对于IO-bound的进程提供快速响应的机制,否则影响用户体验。有了设计目标,我们再来看看这两种调度器是怎么实现这一目标的。
首先先来看看O1调度器,为了实现内核对于不同优先级的偏爱,O1调度器将进程的优先级map成时间片,说白了,就是如果优先级高,就多给点时间给你,代码如下:
/*
* task_timeslice() scales user-nice values [ -20 ... 0 ... 19 ]
* to time slice values: [800ms ... 100ms ... 5ms]
*
* The higher a thread's priority, the bigger timeslices
* it gets during one round of execution. But even the lowest
* priority thread gets MIN_TIMESLICE worth of execution time.
*/
#define SCALE_PRIO(x, prio) \
max(x * (MAX_PRIO - prio) / (MAX_USER_PRIO/2), MIN_TIMESLICE)
static unsigned int task_timeslice(task_t *p)
{
if (p->static_prio < NICE_TO_PRIO(0))
return SCALE_PRIO(DEF_TIMESLICE*4, p->static_prio);
else
return SCALE_PRIO(DEF_TIMESLICE, p->static_prio);
}
注意,此处赋予进程的时间片是以tick为单位,也就是说task_timeslice这个函数返回的是此进程下一次得到调度能够在cpu上运行的tick数,超过此tick数,进程就会被抢占。但是,O1为了偏爱I/O bound的进程,会对它进行奖励:
/*
* This function gets called by the timer code, with HZ frequency.
* We call it with interrupts disabled.
*
* It also gets called by the fork code, when changing the parent's
* timeslices.
*/
void scheduler_tick(void)
{
int cpu = smp_processor_id();
runqueue_t *rq = this_rq();
task_t *p = current;
rq->timestamp_last_tick = sched_clock();
if (p == rq->idle) {
if (wake_priority_sleeper(rq))
goto out;
rebalance_tick(cpu, rq, SCHED_IDLE);
return;
}
/* Task might have expired already, but not scheduled off yet */
if (p->array != rq->active) {
set_tsk_need_resched(p);
goto out;
}
spin_lock(&rq->lock);
/*
* The task was running during this tick - update the
* time slice counter. Note: we do not update a thread's
* priority until it either goes to sleep or uses up its
* timeslice. This makes it possible for interactive tasks
* to use up their timeslices at their highest priority levels.
*/
if (rt_task(p)) {
/*
* RR tasks need a special form of timeslice management.
* FIFO tasks have no timeslices.
*/
if ((p->policy == SCHED_RR) && !--p->time_slice) {
p->time_slice = task_timeslice(p);
p->first_time_slice = 0;
set_tsk_need_resched(p);
/* put it at the end of the queue: */
requeue_task(p, rq->active);
}
goto out_unlock;
}
if (!--p->time_slice) {
dequeue_task(p, rq->active);
set_tsk_need_resched(p);
p->prio = effective_prio(p);
p->time_slice = task_timeslice(p);
p->first_time_slice = 0;
if (!rq->expired_timestamp)
rq->expired_timestamp = jiffies;
if (!TASK_INTERACTIVE(p) || EXPIRED_STARVING(rq)) {
enqueue_task(p, rq->expired);
if (p->static_prio < rq->best_expired_prio)
rq->best_expired_prio = p->static_prio;
} else
enqueue_task(p, rq->active);//为了表示对I/O bound的进程的偏爱,当时间片耗完时,在满足一定条件下,人为增加进程的运行时间
} else {
/*
* Prevent a too long timeslice allowing a task to monopolize
* the CPU. We do this by splitting up the timeslice into
* smaller pieces.
*
* Note: this does not mean the task's timeslices expire or
* get lost in any way, they just might be preempted by
* another task of equal priority. (one with higher
* priority would have preempted this task already.) We
* requeue this task to the end of the list on this priority
* level, which is in essence a round-robin of tasks with
* equal priority.
*
* This only applies to tasks in the interactive
* delta range with at least TIMESLICE_GRANULARITY to requeue.
*/
if (TASK_INTERACTIVE(p) && !((task_timeslice(p) -
p->time_slice) % TIMESLICE_GRANULARITY(p)) &&
(p->time_slice >= TIMESLICE_GRANULARITY(p)) &&
(p->array == rq->active)) {
requeue_task(p, rq->active);
set_tsk_need_resched(p);
}
}
out_unlock:
spin_unlock(&rq->lock);
out:
rebalance_tick(cpu, rq, NOT_IDLE);
}
再分析一下CFS的调度算法,同样关注一下scheduler_tick函数:
/*
* This function gets called by the timer code, with HZ frequency.
* We call it with interrupts disabled.
*
* It also gets called by the fork code, when changing the parent's
* timeslices.
*/
void scheduler_tick(void)
{
int cpu = smp_processor_id();
struct rq *rq = cpu_rq(cpu);
struct task_struct *curr = rq->curr;
u64 next_tick = rq->tick_timestamp + TICK_NSEC;
spin_lock(&rq->lock);
__update_rq_clock(rq);
/*
* Let rq->clock advance by at least TICK_NSEC:
*/
if (unlikely(rq->clock < next_tick)) {
rq->clock = next_tick;
rq->clock_underflows++;
}
rq->tick_timestamp = rq->clock;
update_cpu_load(rq);
curr->sched_class->task_tick(rq, curr, 0);
update_sched_rt_period(rq);
spin_unlock(&rq->lock);
#ifdef CONFIG_SMP
rq->idle_at_tick = idle_cpu(cpu);
trigger_load_balance(rq, cpu);
#endif
}
此函数里首先更新一下runqueue中的clock,然后调用curr->sched_class->task_tick(rq, curr, 0);在cfs中此函数为
/*
* scheduler tick hitting a task of our scheduling class:
*/
static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
{
struct cfs_rq *cfs_rq;
struct sched_entity *se = &curr->se;
for_each_sched_entity(se) {
cfs_rq = cfs_rq_of(se);
entity_tick(cfs_rq, se, queued);
}
}
此函数重点在entity_tick(cfs_rq, se, queued);
:
static void
entity_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr, int queued)
{
/*
* Update run-time statistics of the 'current'.
*/
update_curr(cfs_rq);
#ifdef CONFIG_SCHED_HRTICK
/*
* queued ticks are scheduled to match the slice, so don't bother
* validating it and just reschedule.
*/
if (queued)
return resched_task(rq_of(cfs_rq)->curr);
/*
* don't let the period tick interfere with the hrtick preemption
*/
if (!sched_feat(DOUBLE_TICK) &&
hrtimer_active(&rq_of(cfs_rq)->hrtick_timer))
return;
#endif
if (cfs_rq->nr_running > 1 || !sched_feat(WAKEUP_PREEMPT))
check_preempt_tick(cfs_rq, curr);
}
entity_tick中首先更新一下当前运行的进程的虚拟运行时间,然后调用check_preempt_tick来检查是否需要进行一次schedule的调度,
:
/*
* Preempt the current task with a newly woken task if needed:
*/
static void
check_preempt_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr)
{
unsigned long ideal_runtime, delta_exec;
ideal_runtime = sched_slice(cfs_rq, curr);
delta_exec = curr->sum_exec_runtime - curr->prev_sum_exec_runtime;
if (delta_exec > ideal_runtime)
resched_task(rq_of(cfs_rq)->curr);
}
由此可见,CFS根本不管什么所谓的I/O bound进程或者是CPU bound进程,它只是简单的计算一下你应该运行的理想的时间片,如果你运行的时间已经大于理想运行的时间,ok,那就在下个点触发一次调度。对于不同优先级的进程,它的权重是不一样的:
static void set_load_weight(struct task_struct *p)
{
if (task_has_rt_policy(p)) {
p->se.load.weight = prio_to_weight[0] * 2;
p->se.load.inv_weight = prio_to_wmult[0] >> 1;
return;
}
/*
* SCHED_IDLE tasks get minimal weight:
*/
if (p->policy == SCHED_IDLE) {
p->se.load.weight = WEIGHT_IDLEPRIO;
p->se.load.inv_weight = WMULT_IDLEPRIO;
return;
}
p->se.load.weight = prio_to_weight[p->static_prio - MAX_RT_PRIO];//用来map权重的数组
p->se.load.inv_weight = prio_to_wmult[p->static_prio - MAX_RT_PRIO];
}
不同权重的进程,在一个targeted_latency中分得的时间片也不一样,优先级高的进程,权重就高,获得的ideal_time也多,这样在一个targeted_latency中运行的时间也多,这就是CFS的算法。另外需要注意的是,此处ideal_time的单位不再是tick,而是ns。也就是说,CFS度量一个进程是否需要被抢占已经完全脱离了tick的束缚,与tick无关。
最后,来总结下O1和CFS的异同,以及他们各自的优缺点。首先,O1是通过按照进程的动态优先级来分配时间片,而且此时间片是以tick为单位,以这样一种方式,来达到偏爱(静态优先级高以及I/O bound)的进程,它在识别I/Obound的进程时使用了一系列复杂公式。而CFS是以进程的静态优先级作为它的权重,来分配CPU的power。CFS无需识别I/O bound的进程,因为这样的进程通常在睡眠,所以,它使用的CPU的power自然就少,当它被唤醒时,自然容易得到调度。
另外,由于O1调度算法将进程的优先级直接映射到时间片,这就导致一个问题,因为高优先级的进程得到的时间片多,这就使得两个同等优先级的进程处于刻运行状态时,反而高优先级的进程调度的间隔小于低优先级。比如,nice值-10的进程时间片为100个tick,nice值为10的进程时间片为10个tick,然而当两个nice值为-10的进程同时运行时,进程切换时间为100个tick,而nice值为10的切换时间却为10个tick。但是,nice值高的往往属于I/O bound的进程,本应该更短的切换。反观CFS,假设target_latency为80个tick,只要是nice值相同的两个进程,无论他们优先级是高还是低,都能分得target_latency/2的时间片,即50%的CPU power。
同样的,对于O1而言,在不同的nice值上加一或减一,得到的效果不同。例如,nice值0和1分别为100和95,两者相差并不多,而nice值为18和19的分别为10和5,相差近一倍。而CFS,nice值增加的部分对应的比例总是恒定的。
还有,O1分配时间片时以tick为单位,这样同样的nice值,在不同的设备上map的时间就可能不一样。而CFS始终以ns为单位,因此,不存在这样的问题。
最后,上面提到O1为了偏爱I/O bound的进程,如最开始所讲,会人为赋予这类进程更多的时间片。这其实在特定情况下,也会使I/O bound的进程占用太多本不该占用的时间。而CFS并不会多给任何进程时间片。
其实,上面所有的问题,都归根于设计理念的不同。O1实现的是进程的优先级对于时间片的绝对映射,而CFS实现的是优先级对于power的映射。因此,O1进程切换时间相对恒定而达不到进程间分享cpu的相对公平。而CFS先划定一个targeted_latency,在此基础上,分配各个进程的时间片(实际上是cpu的power),这样虽然切换时间不恒定,但达到进程之间的相对公平。
总体上来看,CFS设计简洁,没有太多的启发式复杂的算法,而且避免了O1之前带来的很多问题。而且因为它的数据接口使用的是red-black tree,虽然比不上o1,但log(n)在性能上也够用。所以,CFS最终以它简单而健壮的架构,取代O1,成为新一代linux kernel的调度器。