Chinaunix首页 | 论坛 | 博客
  • 博客访问: 1855521
  • 博文数量: 317
  • 博客积分: 1557
  • 博客等级: 上尉
  • 技术积分: 1208
  • 用 户 组: 普通用户
  • 注册时间: 2008-02-26 23:38
个人简介

如果想出发,就不要等到明天!

文章分类

全部博文(317)

文章存档

2016年(1)

2015年(41)

2014年(152)

2013年(114)

2012年(4)

2011年(1)

2009年(4)

分类: LINUX

2013-07-08 09:54:09

关于O(1)调度器或者CFS调度器的分析,网上已经有不少前辈的博文做了详细的介绍,再加以赘述也没啥必要,因此,本文想结合源码详细探讨一下O1调度器和CFS调度器本质上的区别。

首先明确一下,O1调度器和CFS都是属于非实时进程的调度器,实时进程他们是不管用的。当有实时进程和传统进程同时处于可运行状态时,实时进程总是得到内核的偏爱,最先得到调度:在schedule()中会调用

/*

 * Pick up the highest-prio task:

 */

static inline struct task_struct *

pick_next_task(struct rq *rq, struct task_struct *prev)

{

         const struct sched_class *class;

         struct task_struct *p;

 

         /*

          * Optimization: we know that if all tasks are in

          * the fair class we can call that function directly:

          */

         if (likely(rq->nr_running == rq->cfs.nr_running)) {//大多数情况下,系统中没有实时进程,此处因此做了优化

                   p = fair_sched_class.pick_next_task(rq);

                   if (likely(p))

                            return p;

         }

 

         class = sched_class_highest;//内核总是偏爱实时进程,

         for ( ; ; ) {

                   p = class->pick_next_task(rq);

                   if (p)

                            return p;

                   /*

                    * Will never be NULL as the idle class always

                    * returns a non-NULL p:

                    */

                   class = class->next;

         }

}

sched_class_highest定义如下:

#define sched_class_highest (&rt_sched_class)

 

这就是说只要运行队列中有实时进程,那就调度它,不管你有多少传统进程,都别想得到调度。此时神马O1CFS都是浮云。但正常情况下,系统中是没有实时进程的,这也是上面代码做了优化的原因。因此,本文讨论的前提也是假定系统中只有传统进程,没有实时进程。

设定了前提之后,再说说调度器设计的要求:使得多个处于可运行状态的进程轮流在cpu上执行,从而产生几个程序同时运行的假象。说白了就是微观串行,宏观并行,另外,由于每个进程有不同的优先级,所以,内核需要让优先级高的进程更多的得到调度。除此之外,还需要对于IO-bound的进程提供快速响应的机制,否则影响用户体验。有了设计目标,我们再来看看这两种调度器是怎么实现这一目标的。

首先先来看看O1调度器,为了实现内核对于不同优先级的偏爱,O1调度器将进程的优先级map成时间片,说白了,就是如果优先级高,就多给点时间给你,代码如下:

/*

 * task_timeslice() scales user-nice values [ -20 ... 0 ... 19 ]

 * to time slice values: [800ms ... 100ms ... 5ms]

 *

 * The higher a thread's priority, the bigger timeslices

 * it gets during one round of execution. But even the lowest

 * priority thread gets MIN_TIMESLICE worth of execution time.

 */

 

#define SCALE_PRIO(x, prio) \

         max(x * (MAX_PRIO - prio) / (MAX_USER_PRIO/2), MIN_TIMESLICE)

 

static unsigned int task_timeslice(task_t *p)

{

         if (p->static_prio < NICE_TO_PRIO(0))

                   return SCALE_PRIO(DEF_TIMESLICE*4, p->static_prio);

         else

                   return SCALE_PRIO(DEF_TIMESLICE, p->static_prio);

}

注意,此处赋予进程的时间片是以tick为单位,也就是说task_timeslice这个函数返回的是此进程下一次得到调度能够在cpu上运行的tick数,超过此tick数,进程就会被抢占。但是,O1为了偏爱I/O bound的进程,会对它进行奖励:

 

/*

 * This function gets called by the timer code, with HZ frequency.

 * We call it with interrupts disabled.

 *

 * It also gets called by the fork code, when changing the parent's

 * timeslices.

 */

void scheduler_tick(void)

{

         int cpu = smp_processor_id();

         runqueue_t *rq = this_rq();

         task_t *p = current;

 

         rq->timestamp_last_tick = sched_clock();

 

         if (p == rq->idle) {

                   if (wake_priority_sleeper(rq))

                            goto out;

                   rebalance_tick(cpu, rq, SCHED_IDLE);

                   return;

         }

 

         /* Task might have expired already, but not scheduled off yet */

         if (p->array != rq->active) {

                   set_tsk_need_resched(p);

                   goto out;

         }

         spin_lock(&rq->lock);

         /*

          * The task was running during this tick - update the

          * time slice counter. Note: we do not update a thread's

          * priority until it either goes to sleep or uses up its

          * timeslice. This makes it possible for interactive tasks

          * to use up their timeslices at their highest priority levels.

          */

         if (rt_task(p)) {

                   /*

                    * RR tasks need a special form of timeslice management.

                    * FIFO tasks have no timeslices.

                    */

                   if ((p->policy == SCHED_RR) && !--p->time_slice) {

                            p->time_slice = task_timeslice(p);

                            p->first_time_slice = 0;

                            set_tsk_need_resched(p);

 

                            /* put it at the end of the queue: */

                            requeue_task(p, rq->active);

                   }

                   goto out_unlock;

         }

         if (!--p->time_slice) {

                   dequeue_task(p, rq->active);

                   set_tsk_need_resched(p);

                   p->prio = effective_prio(p);

                   p->time_slice = task_timeslice(p);

                   p->first_time_slice = 0;

 

                   if (!rq->expired_timestamp)

                            rq->expired_timestamp = jiffies;

                   if (!TASK_INTERACTIVE(p) || EXPIRED_STARVING(rq)) {

                            enqueue_task(p, rq->expired);

                            if (p->static_prio < rq->best_expired_prio)

                                     rq->best_expired_prio = p->static_prio;

                   } else

                            enqueue_task(p, rq->active);//为了表示对I/O bound的进程的偏爱,当时间片耗完时,在满足一定条件下,人为增加进程的运行时间

         } else {

                   /*

                    * Prevent a too long timeslice allowing a task to monopolize

                    * the CPU. We do this by splitting up the timeslice into

                    * smaller pieces.

                    *

                    * Note: this does not mean the task's timeslices expire or

                    * get lost in any way, they just might be preempted by

                    * another task of equal priority. (one with higher

                    * priority would have preempted this task already.) We

                    * requeue this task to the end of the list on this priority

                    * level, which is in essence a round-robin of tasks with

                    * equal priority.

                    *

                    * This only applies to tasks in the interactive

                    * delta range with at least TIMESLICE_GRANULARITY to requeue.

                    */

                   if (TASK_INTERACTIVE(p) && !((task_timeslice(p) -

                            p->time_slice) % TIMESLICE_GRANULARITY(p)) &&

                            (p->time_slice >= TIMESLICE_GRANULARITY(p)) &&

                            (p->array == rq->active)) {

 

                            requeue_task(p, rq->active);

                            set_tsk_need_resched(p);

                   }

         }

out_unlock:

         spin_unlock(&rq->lock);

out:

         rebalance_tick(cpu, rq, NOT_IDLE);

}

 

再分析一下CFS的调度算法,同样关注一下scheduler_tick函数:

 

/*

 * This function gets called by the timer code, with HZ frequency.

 * We call it with interrupts disabled.

 *

 * It also gets called by the fork code, when changing the parent's

 * timeslices.

 */

void scheduler_tick(void)

{

         int cpu = smp_processor_id();

         struct rq *rq = cpu_rq(cpu);

         struct task_struct *curr = rq->curr;

         u64 next_tick = rq->tick_timestamp + TICK_NSEC;

 

         spin_lock(&rq->lock);

         __update_rq_clock(rq);

         /*

          * Let rq->clock advance by at least TICK_NSEC:

          */

         if (unlikely(rq->clock < next_tick)) {

                   rq->clock = next_tick;

                   rq->clock_underflows++;

         }

         rq->tick_timestamp = rq->clock;

         update_cpu_load(rq);

         curr->sched_class->task_tick(rq, curr, 0);

         update_sched_rt_period(rq);

         spin_unlock(&rq->lock);

 

#ifdef CONFIG_SMP

         rq->idle_at_tick = idle_cpu(cpu);

         trigger_load_balance(rq, cpu);

#endif

}

此函数里首先更新一下runqueue中的clock,然后调用curr->sched_class->task_tick(rq, curr, 0);cfs中此函数为

 

/*

 * scheduler tick hitting a task of our scheduling class:

 */

static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)

{

         struct cfs_rq *cfs_rq;

         struct sched_entity *se = &curr->se;

 

         for_each_sched_entity(se) {

                   cfs_rq = cfs_rq_of(se);

                   entity_tick(cfs_rq, se, queued);

         }

}

此函数重点在entity_tick(cfs_rq, se, queued);

static void

entity_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr, int queued)

{

         /*

          * Update run-time statistics of the 'current'.

          */

         update_curr(cfs_rq);

 

#ifdef CONFIG_SCHED_HRTICK

         /*

          * queued ticks are scheduled to match the slice, so don't bother

          * validating it and just reschedule.

          */

         if (queued)

                   return resched_task(rq_of(cfs_rq)->curr);

         /*

          * don't let the period tick interfere with the hrtick preemption

          */

         if (!sched_feat(DOUBLE_TICK) &&

                            hrtimer_active(&rq_of(cfs_rq)->hrtick_timer))

                   return;

#endif

 

         if (cfs_rq->nr_running > 1 || !sched_feat(WAKEUP_PREEMPT))

                   check_preempt_tick(cfs_rq, curr);

}

entity_tick中首先更新一下当前运行的进程的虚拟运行时间,然后调用check_preempt_tick来检查是否需要进行一次schedule的调度,

 

/*

 * Preempt the current task with a newly woken task if needed:

 */

static void

check_preempt_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr)

{

         unsigned long ideal_runtime, delta_exec;

 

         ideal_runtime = sched_slice(cfs_rq, curr);

         delta_exec = curr->sum_exec_runtime - curr->prev_sum_exec_runtime;

         if (delta_exec > ideal_runtime)

                   resched_task(rq_of(cfs_rq)->curr);

}

 

由此可见,CFS根本不管什么所谓的I/O bound进程或者是CPU bound进程,它只是简单的计算一下你应该运行的理想的时间片,如果你运行的时间已经大于理想运行的时间,ok,那就在下个点触发一次调度。对于不同优先级的进程,它的权重是不一样的:

 

static void set_load_weight(struct task_struct *p)

{

         if (task_has_rt_policy(p)) {

                   p->se.load.weight = prio_to_weight[0] * 2;

                   p->se.load.inv_weight = prio_to_wmult[0] >> 1;

                   return;

         }

 

         /*

          * SCHED_IDLE tasks get minimal weight:

          */

         if (p->policy == SCHED_IDLE) {

                   p->se.load.weight = WEIGHT_IDLEPRIO;

                   p->se.load.inv_weight = WMULT_IDLEPRIO;

                   return;

         }

 

         p->se.load.weight = prio_to_weight[p->static_prio - MAX_RT_PRIO];//用来map权重的数组

         p->se.load.inv_weight = prio_to_wmult[p->static_prio - MAX_RT_PRIO];

}

不同权重的进程,在一个targeted_latency中分得的时间片也不一样,优先级高的进程,权重就高,获得的ideal_time也多,这样在一个targeted_latency中运行的时间也多,这就是CFS的算法。另外需要注意的是,此处ideal_time的单位不再是tick,而是ns。也就是说,CFS度量一个进程是否需要被抢占已经完全脱离了tick的束缚,与tick无关。

最后,来总结下O1CFS的异同,以及他们各自的优缺点。首先,O1是通过按照进程的动态优先级来分配时间片,而且此时间片是以tick为单位,以这样一种方式,来达到偏爱(静态优先级高以及I/O bound)的进程,它在识别I/Obound的进程时使用了一系列复杂公式。而CFS是以进程的静态优先级作为它的权重,来分配CPUpowerCFS无需识别I/O bound的进程,因为这样的进程通常在睡眠,所以,它使用的CPUpower自然就少,当它被唤醒时,自然容易得到调度。

另外,由于O1调度算法将进程的优先级直接映射到时间片,这就导致一个问题,因为高优先级的进程得到的时间片多,这就使得两个同等优先级的进程处于刻运行状态时,反而高优先级的进程调度的间隔小于低优先级。比如,nice-10的进程时间片为100ticknice值为10的进程时间片为10tick,然而当两个nice值为-10的进程同时运行时,进程切换时间为100tick,而nice值为10的切换时间却为10tick。但是,nice值高的往往属于I/O bound的进程,本应该更短的切换。反观CFS,假设target_latency80tick,只要是nice值相同的两个进程,无论他们优先级是高还是低,都能分得target_latency/2的时间片,即50%CPU power

同样的,对于O1而言,在不同的nice值上加一或减一,得到的效果不同。例如,nice01分别为10095,两者相差并不多,而nice值为1819的分别为105,相差近一倍。而CFSnice值增加的部分对应的比例总是恒定的。

还有,O1分配时间片时以tick为单位,这样同样的nice值,在不同的设备上map的时间就可能不一样。而CFS始终以ns为单位,因此,不存在这样的问题。

最后,上面提到O1为了偏爱I/O bound的进程,如最开始所讲,会人为赋予这类进程更多的时间片。这其实在特定情况下,也会使I/O bound的进程占用太多本不该占用的时间。而CFS并不会多给任何进程时间片。

其实,上面所有的问题,都归根于设计理念的不同。O1实现的是进程的优先级对于时间片的绝对映射,而CFS实现的是优先级对于power的映射。因此,O1进程切换时间相对恒定而达不到进程间分享cpu的相对公平。而CFS先划定一个targeted_latency,在此基础上,分配各个进程的时间片(实际上是cpupower),这样虽然切换时间不恒定,但达到进程之间的相对公平。

总体上来看,CFS设计简洁,没有太多的启发式复杂的算法,而且避免了O1之前带来的很多问题。而且因为它的数据接口使用的是red-black tree,虽然比不上o1,但log(n)在性能上也够用。所以,CFS最终以它简单而健壮的架构,取代O1,成为新一代linux kernel的调度器。


阅读(3278) | 评论(0) | 转发(0) |
给主人留下些什么吧!~~