完全公平调度器(cfs)系列文章(二)-wangxingchao2010-ChinaUnix博客

wangxingchao2010

首页　| 　博文目录　| 　关于我

wangxingchao201

博客访问： 343636
博文数量： 102
博客积分： 2510
博客等级：少校
技术积分： 1146
用户组：普通用户
注册时间： 2010-01-21 22:33

文章分类

全部博文（102）

Borqs同事交流（0）
虚拟化（3）
云计算Cloud（2）
驱动模块（1）
生活（4）
职场（10）
业界（8）
OMS（1）
Android（16）
文件系统（16）
ULK笔记（5）
内核时间（6）
内核调度（28）
未分配的博文（2）

文章存档

2011年（8）

2010年（94）

我的朋友

embedded

相关博文

完全公平调度器(cfs)系列文章(二)

分类： LINUX

2010-01-24 21:48:35

6）静悄悄地过来，看看这个经常调用的函数，到底做了啥捏？
static inline void
__update_curr(struct cfs_rq *cfs_rq, struct sched_entity *curr,
          unsigned long delta_exec) 传进来的执行时间
{
    unsigned long delta_exec_weighted;

    schedstat_set(curr->exec_max, max((u64)delta_exec, curr->exec_max));

    curr->sum_exec_runtime += delta_exec; 直接加到总运行时间变量里去，
    schedstat_add(cfs_rq, exec_clock, delta_exec);
    delta_exec_weighted = calc_delta_fair(delta_exec, curr); 这个很重要。。。这个函数名字叫”公平计算delta"，咋公平捏？第七步分析会看到
    curr->vruntime += delta_exec_weighted; 把上步计算出来的值加到了vruntime里
    update_min_vruntime(cfs_rq);
}

7) 看看怎么计算这个delta_exec_weighted的？
/*
* delta /= w
*/
static inline unsigned long
calc_delta_fair(unsigned long delta, struct sched_entity *se)
{
    if (unlikely(se->load.weight != NICE_0_LOAD)) 如果进程实体有默认的load值，直接返回delta
        delta = calc_delta_mine(delta, NICE_0_LOAD, &se->load);计算应该修正的这个值

    return delta;
}
在往下走之前，先弄明白nice 和 se->load之间的关系吧。
nice 都知道，se->load是指调度实体的负载。同理一个cfs_rq也有负载。。一般是所有task的load之和。这个load是通过nice与一个静态数组转换来的。一般普通进程的nice值在-20~19之间，其对应load.weight值是递减的，具体可参见那个静态数组。而 NICE_0_LOAD就是nice=0的对应的load值。
好了，这相当于是做一个修正喽。假如现在传进来的运行时间是10,那么，这个调度实体应该返回的delta是多少呢？看这个函数的comment就明白了，delta * NICE_0_LOAD/se->load，是一个比重。。。之前这个地方说过了，呵呵。

那么，调度实体的优先级越高，其得出来的值越小。
反回到6)中，可以知道vruntime是怎么update的了。那么，它是如何初始化的呢？后面有，再说,是通过vslice()计算的。
8）我们再返回到task_new_fair,看update_curr()后，接下来做的事情   place_entity.

这个函数也比较好玩，做得是一个”奖励“工作：

static void
place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int initial)
{
    u64 vruntime = cfs_rq->min_vruntime;

    /*
     * The 'current' period is already promised to the current tasks,
     * however the extra weight of the new task will slow them down a
     * little, place the new task so that it fits in the slot that
     * stays open at the end.
     */
    if (initial && sched_feat(START_DEBIT))
        vruntime += sched_vslice(cfs_rq, se); 稍微加一些，呵呵

    if (!initial) {    如果是睡了，唤醒的，应该有些补偿的   具体怎么补，多了怎么办，少了怎么办？
        /* sleeps upto a single latency don't count. */
        if (sched_feat(NEW_FAIR_SLEEPERS)) {
            unsigned long thresh = sysctl_sched_latency;

            /*
             * convert the sleeper threshold into virtual time
             */
            if (sched_feat(NORMALIZED_SLEEPER))
                thresh = calc_delta_fair(thresh, se);

            vruntime -= thresh;
        }

        /* ensure we never gain time by being placed backwards. */
        vruntime = max_vruntime(se->vruntime, vruntime);
    }

    se->vruntime = vruntime;
}

9）最后，task_new_fair做的事情就是让task入列。enqueue_task_fair()
9) static void enqueue_task_fair(struct rq *rq, struct task_struct *p, int wakeup)
{
    struct cfs_rq *cfs_rq;
    struct sched_entity *se = &p->se;

    for_each_sched_entity(se) {遍历所有的se及它的父亲。。。这个在非组调度时，就是se,组调度时，会将其dad一同入列
        if (se->on_rq)如果已经在运行队列，就停止
            break;
        cfs_rq = cfs_rq_of(se);
        enqueue_entity(cfs_rq, se, wakeup);真正的入队函数，，，把se插入到rb-tree.
        wakeup = 1;
    }

    hrtick_update(rq);   hr部分的，后面再分析。
}

10)enqueue_entity() 会更新时间，最终调────enqueue_entity()，把schedule_entity往红黑树中存放。每个结点的值是通过 entity_key()来实现的，比较奇怪，是se->vruntime-cfs->min_vruntime....不知道这么做。

入列操作有一些跟唤醒判断有关的，后面再查资料看看
11）再回到wake_up_new_task()来。
会调用check_preempt_curr()，判断当前进程是否被新进程抢占。这个函数调用sched_fair_class里的 check_preempt_wakeup,这个函数也很好玩儿，它会调其中一个函数，wakeup_preempt_entity()，来判断当前的进程和新进程之前满足什么样的关系才可以抢占。

这时新加了个区间：新进程的vruntime比current小，但要满足一定条件，才能完成抢占。
三种情况返回值为-1/0/1,代表不同的意思。

最后我们又返回到了wake_up_new_task.一个新进程创建后被调度的过程大体就是上面的流程。

另一个比较有意思的就是tick了，明天看看tick中断做的事情吧。

一些有意思的关于cfs调度的资料：
   视频 cfs作者的演讲
Some kernel hackers...
Thomas Gleixner

Schedulers: the plot thickens

[patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]

http://www.ibm.com/developerworks/cn/linux/l-cfs/index.html 完全公平调度程序

鼠眼看Linux调度器
http://blog.chinaunix.net/u1/42957/showart.php?id=337604

http://www.ibm.com/developerwork ... cheduler/index.html
Linux 调度器发展简述

至今不敢写一篇关于cfs的文章收藏
http://blog.csdn.net/dog250/archive/2009/01/15/3793000.aspx

http://www.ibm.com/developerworks/cn/linux/l-cfs/index.html
使用完全公平调度程序（CFS）进行多任务处理

3 关于tick中断
为了保证调度，在每个tick都会对进程时间信息等进行更新。首先找一个从hrtimer上层到进程调度之间的点，暂时往进程调度的方向说，后面谈到hrtimer时，再往hrtimer追根溯源吧。
这个点就是，update_process_times,它是在timer interrupt 中被调的，它会调一个跟process直接相关的函数,scheduler_tick().
/*
* This function gets called by the timer code, with HZ frequency.
* We call it with interrupts disabled.
*
* It also gets called by the fork code, when changing the parent's
* timeslices.
*/
void scheduler_tick(void)
{
    int cpu = smp_processor_id();
    ...
    update_cpu_load(rq);
    curr->sched_class->task_tick(rq, curr, 0);   task_tick即公平调度类里的
    spin_unlock(&rq->lock);

#ifdef CONFIG_SMP
    rq->idle_at_tick = idle_cpu(cpu);
    trigger_load_balance(rq, cpu);
#endif
}

看看task_tick：
/*
* scheduler tick hitting a task of our scheduling class:
*/
static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
{
    struct cfs_rq *cfs_rq;
    struct sched_entity *se = &curr->se;

    for_each_sched_entity(se) {   老朋友了，组调度的时候会遍历父亲们，否则就是它自己
        cfs_rq = cfs_rq_of(se);
        entity_tick(cfs_rq, se, queued); 传进来的queued=0
    }
}

继续看entity_tick;

static void
entity_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr, int queued)
{
    /*
    * Update run-time statistics of the 'current'.
    */
    update_curr(cfs_rq);   更新当前进程current的信息，这个昨天已经看过了

#ifdef CONFIG_SCHED_HRTICK
    /*
    * queued ticks are scheduled to match the slice, so don't bother
    * validating it and just reschedule.
    */
    if (queued) {
        resched_task(rq_of(cfs_rq)->curr);
        return;
    }
    /*
    * don't let the period tick interfere with the hrtick preemption
    */
    if (!sched_feat(DOUBLE_TICK) &&
            hrtimer_active(&rq_of(cfs_rq)->hrtick_timer))
        return;
#endif

    if (cfs_rq->nr_running > 1 || !sched_feat(WAKEUP_PREEMPT))
        check_preempt_tick(cfs_rq, curr); 这个要与check_preempt_curr对比着看，我当初就在这个地方晕了。。。注释是这样的：“ Preempt the current task with a newly woken task if needed:”

}

把两个函数都贴出来：

1） static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int sync)
{
    struct task_struct *curr = rq->curr;
    struct sched_entity *se = &curr->se, *pse = &p->se;
。。。
    if (wakeup_preempt_entity(se, pse) == 1) { 这个是根据当前进程curr和要调度的p之前的vruntime比较，有一个区间限度。
。。。
}
2）static void
check_preempt_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr)
{
    unsigned long ideal_runtime, delta_exec;

    ideal_runtime = sched_slice(cfs_rq, curr);算一下当前进程理想的运行时间是多少？
    delta_exec = curr->sum_exec_runtime - curr->prev_sum_exec_runtime; 算算它已经运行了多久？
    if (delta_exec > ideal_runtime)
        resched_task(rq_of(cfs_rq)->curr);
}

有趣的是，上面两个函数的comment是一样的：-）

第二个函数折磨了我好久~ 既然是完全公平调度，每个进程的运行时间应该完全公平呀？实际上ideal_runtime 是与进程的优先级有关系的。
首先看ideal_runtime是怎么计算出来的。
3)
/*
* We calculate the wall-time slice from the period by taking a part
* proportional to the weight.
*
* s = p*P[w/rw]这是通过一个比例算出的
*/
static u64 sched_slice(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
    unsigned long nr_running = cfs_rq->nr_running;运行的进程数目

    if (unlikely(!se->on_rq))
        nr_running++;

    return calc_delta_weight(__sched_period(nr_running), se);   这个weight是跟进程数目有关系的
}

4)看看cale_delata_weight第一个参数是怎么得到的。
/*
* The idea is to set a period in which each task runs once.   设置一个period,让每个进程都跑一遍。
*
* When there are too many tasks (sysctl_sched_nr_latency) we have to stretch
* this period because otherwise the slices get too small. 当进程数目大于默认值（5）时，要拉伸一下这个period
*
* p = (nr <= nl) ? l : l*nr/nl
*/
static u64 __sched_period(unsigned long nr_running)
{
    u64 period = sysctl_sched_latency;
    unsigned long nr_latency = sched_nr_latency;

    if (unlikely(nr_running > nr_latency)) {
        period = sysctl_sched_min_granularity; 最小粒度   4ms
        period *= nr_running; 4ms * 进程数目
    }

    return period;
}

这个比较好理解，小于5个进程时，这个period周期是20ms,大于5个running number 时，就让4*nr;后面就是线性的，

回到3,计算出来的这个period作为calc_delata_weight（）的第一个参数往下传，继而算出当前se的比重来。
5）
/*
* delta *= P[w / rw]
*/
static inline unsigned long
calc_delta_weight(unsigned long delta, struct sched_entity *se)
{
    for_each_sched_entity(se) {
        delta = calc_delta_mine(delta,
                se->load.weight, &cfs_rq_of(se)->load); 这个函数不陌生，在计算vruntime的时候也有它，不过那个函数是calc_delta_fair,重点是中间的参数，一个是 nice_0_load,一个是当前进程的load
    }

    return delta;
}
这个很简单：   假如n 个进程的period是4n,其中进程i的 load.weight是m,所有进程的load是M，那么进程i应该在这个period里面分到的蛋糕是： 4n * m/M

再返回check_preempt_tick，通过比较实际计算出来的值与理想值比较，确定当前进程是否被抢占。。

这个地方有个疑问： check_preempt_tick是在tick中断里调的，check_preempt_curr()昨晚分析过在进程创建的时候有过调用，后都传进去一个需要调度的进程作为参数，与当前进程比较，满足一定的条件时，才会把当前进程切换掉。前者是在tick中断里掉，不知道下一个调的进程是谁，这个应该由调度器后面去选一个。没关系，它的工作就是看看当前进程把自己的时间片用完了没？用完了的话，我就回去告诉调度器，把你调度走，，，具体怎么调度呢？下面我们接着看吧。。。先猜想一下：1 ）把当前进程出列，调整其在红黑树中的位置。2）从红黑树中选择下最左结点运行

4 调度函数
resched_task（）里设置某个进程需要调度。其实就是设定一个flag而已。
static inline void set_tsk_need_resched(struct task_struct *tsk)
{
    set_tsk_thread_flag(tsk,TIF_NEED_RESCHED);
}

来看看大boss,
/*
* schedule() is the main scheduler function.
*/
asmlinkage void __sched schedule(void)
{
...

    prev->sched_class->put_prev_task(rq, prev);
    next = pick_next_task(rq, prev);

....
}

中间这两个函数，是我们关心的。

1)
/*
* Account for a descheduled task:
*/
static void put_prev_task_fair(struct rq *rq, struct task_struct *prev)
{
    struct sched_entity *se = &prev->se;
    struct cfs_rq *cfs_rq;

    for_each_sched_entity(se) {
        cfs_rq = cfs_rq_of(se);
        put_prev_entity(cfs_rq, se);
    }
}

这是对一个即将被调度出去的进程的处理。
2)    它会掉pub_prev_entity 这个例程看得有点儿迷糊。如果进程还在runqueue上，就把它重新放回红黑树，只是位置变了,并把当前的curr指针清空。
static void put_prev_entity(struct cfs_rq *cfs_rq, struct sched_entity *prev)
{
    /*
    * If still on the runqueue then deactivate_task()
    * was not called and update_curr() has to be done:
    */
    if (prev->on_rq) 这个条件还得研究研究，什么情况下用与不用
        update_curr(cfs_rq);

    check_spread(cfs_rq, prev);
    if (prev->on_rq) {
        update_stats_wait_start(cfs_rq, prev);
        /* Put 'current' back into the tree. */
        __enqueue_entity(cfs_rq, prev);   怎么又放回去呢？
    }
    cfs_rq->curr = NULL;
}

3) 回到scheduler函数，
static struct task_struct *pick_next_task_fair(struct rq *rq)
{
    struct task_struct *p;
    struct cfs_rq *cfs_rq = &rq->cfs;
    struct sched_entity *se;

    if (unlikely(!cfs_rq->nr_running))
        return NULL;

    do {
        se = pick_next_entity(cfs_rq); 挑出下一个可以运行的
        set_next_entity(cfs_rq, se); 设置它为cfs_rq的curr
        cfs_rq = group_cfs_rq(se);是group的 se么？
    } while (cfs_rq);   这个循环其实很有意思，它为group调度提供支持。如果cfs_rq这层里面的进程被选择完毕，它会接着选择其父task_group里的去运行。

    p = task_of(se);
    hrtick_start_fair(rq, p);

    return p;
}

4) pick_next_entity会调用__pick_next_entity(cfs_rq)去选择红黑树最左侧的结点，然后有两个条件判断，作为返回se的最终人选。
"
    if (cfs_rq->next && wakeup_preempt_entity(cfs_rq->next, se) < 1)
        return cfs_rq->next;

    if (cfs_rq->last && wakeup_preempt_entity(cfs_rq->last, se) < 1)
        return cfs_rq->last;
"
wakeup_preempt_entity（）这个我在前面分析过了，这是最终决定两个进程能否进行切换的一个判断。。。next 和 last估计是cfs里定时更新的亲戚，好事当然先轮着自己喽。。所以这个地方调度会优先选择下，，，具体再讨论吧。

5) 选出来了以后，会通过 set_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)设置一些信息，把它贼给cfs_rq->curr,更新总运行时间等。
6）这儿还有一个奇怪的函数：
hrtick_start_fair(rq, p);
我认为选出了可以运行的进程，下一步应该真正让它运行吧，，这步操作在哪呢？
上面这个函数只是做了些判断，是否当前进程真得需要运行？然后更新了些与hrtick相关的rq信息。
也有可能是scheduler(）函数下面做的吧，具体我还没有打到。

这儿有个小猜想，希望得到验证：
1) a task call schedule(),it's se->on_rq = 1,so it need denqueue .call deactivate_task()

2) a task picked by schedule() need to get the cpu to run,it's se->on_rq = 0,so
it need enqueue.
3) a task's exec time is over,need change to rb-tree's right it's se->on_rq = 1,just change the tree

这个再想想。。。。

阅读(1242) | 评论(0) | 转发(0) |

下一篇：完全公平调度器(cfs)系列文章(三)

给主人留下些什么吧！~~

感谢所有关心和支持过ChinaUnix的朋友们

16024965号-6