linux调度器（三）- fair class入门-mournjust-ChinaUnix博客

mournjustmournjust.blog.chinaunix.net

首页　| 　博文目录　| 　关于我

mournjust

博客访问： 762472
博文数量： 79
博客积分： 2671
博客等级：少校
技术积分： 1247
用户组：普通用户
注册时间： 2010-04-02 15:26

个人简介

宅男

文章分类

全部博文（79）

scheduler（7）
perl（3）
Android（10）
linux编译（3）
文件系统（23）
linux TCP/IP（0）
调试心得（4）
sep0718（3）
数据结构、编程（6）
杂谈，心情（6）
linux内核源码（14）
未分配的博文（0）

文章存档

2017年（11）

2016年（12）

2015年（6）

2012年（10）

2011年（33）

2010年（7）

我的朋友

相关博文

linux调度器（三）- fair class入门

分类： Android平台

2017-07-07 12:32:50

在看这一篇之前先看一下sched_whitepapersched_whitepaper.docx中的普通进程的模型。关于vruntime以及weight。

这两点对于fair进程至关重要。

CFS（completely fair scheduler）并不是完全公平的，而且也不可能是完全公平的，因为进程会更加其重要性而有优先级之分。

在pick_next_entity中，仅仅是从每一个cfs rq中去选择vruntime最小的进程。所以vruntime成为fair class进程也就是普通进程的唯一的标识符。

下面就从pick_next_entity函数进入cfs吧。

static struct sched_entity *

pick_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *curr)

{

struct sched_entity *left = __pick_first_entity(cfs_rq);

这个函数看起来似乎有点的复杂，但是最主要的其实就是第一句。

Linux将所有的进程以vruntime为key，以红黑树的形式组织在cfs rq中。显然__pick_first_entity就是去获取vruntime最小的entity。这边顺便提一句，由于cgroup的存在。Pick_next_entity获得entity并不一定是一个实际的task，而有可能是一个cgroup。

所以在pick_next_task_fair中是一个do while的形式。

do {
                struct sched_entity *curr = cfs_rq->curr;

                /*
                 * Since we got here without doing put_prev_entity() we also
                 * have to consider cfs_rq->curr. If it is still a runnable
                 * entity, update_curr() will update its vruntime, otherwise
                 * forget we've ever seen it.
                 */
                if (curr) {
                        if (curr->on_rq)
                                update_curr(cfs_rq);
                        else
                                curr = NULL;

                        /*
                         * This call to check_cfs_rq_runtime() will do the
                         * throttle and dequeue its entity in the parent(s).
                         * Therefore the 'simple' nr_running test will indeed
                         * be correct.
                         */
                        if (unlikely(check_cfs_rq_runtime(cfs_rq)))
                                goto simple;
                }

                se = pick_next_entity(cfs_rq, curr);
                cfs_rq = group_cfs_rq(se);
        } while (cfs_rq);

属于同一个cgroup的task，都以类似的形式组织在se->my_q中。My_q也是一个cfs rq。用来表示属于一个cgroup的不同task。这些task也跟cpu runqueue上的进程类似，在cgroup内部竞争CPU的使用权。

有点远了，再次回到Pick_next_entity函数，除了上面提到的第一句代码之外，其余的代码都用于处理特殊情况。即set_next_entity/set_skip_buddy/set_last_buddy，这部分后面再说。

上面说道了CFS在vruntime层面上是公平的，weight影响着task的实际运行时间。

实时进程的优先级从0到99，普通进程的优先级从100开始到140.为了计算简单，linux维护了一个数组用来表示不同优先级的进程对于的weight值。

static const int prio_to_weight[40] = {
/* -20 */ 88761, 71755, 56483, 46273, 36291,
/* -15 */ 29154, 23254, 18705, 14949, 11916,
/* -10 */ 9548, 7620, 6100, 4904, 3906,
/* -5 */ 3121, 2501, 1991, 1586, 1277,
/* 0 */ 1024, 820, 655, 526, 423,
/* 5 */ 335, 272, 215, 172, 137,
/* 10 */ 110, 87, 70, 56, 45,
/* 15 */ 36, 29, 23, 18, 15,
}

同时维护了另外一个数组，里面是2^32/x的预先计算的值。

static const u32 prio_to_wmult[40] = {
/* -20 */ 48388, 59856, 76040, 92818, 118348,
/* -15 */ 147320, 184698, 229616, 287308, 360437,
/* -10 */ 449829, 563644, 704093, 875809, 1099582,
/* -5 */ 1376151, 1717300, 2157191, 2708050, 3363326,
/* 0 */ 4194304, 5237765, 6557202, 8165337, 10153587,
/* 5 */ 12820798, 15790321, 19976592, 24970740, 31350126,
/* 10 */ 39045157, 49367440, 61356676, 76695844, 95443717,
/* 15 */ 119304647, 148102320, 186737708, 238609294, 286331153,
};

其实就weight以及1/weight的预先算好的值，用于后续的计算。但是由于linux kernel对于浮点的支持不好，
所以这边的1/weight用2^32/weight来代替（__update_inv_weight函数）。

我们看一下update_curr中一段关于vruntime的计算。

static void update_curr(struct cfs_rq *cfs_rq)
{
struct sched_entity *curr = cfs_rq->curr;
u64 now = rq_clock_task(rq_of(cfs_rq));
u64 delta_exec;
if (unlikely(!curr))
return;
delta_exec = now - curr->exec_start;
if (unlikely((s64)delta_exec <= 0))
return;
curr->exec_start = now;
schedstat_set(curr->statistics.exec_max,
max(delta_exec, curr->statistics.exec_max));
curr->sum_exec_runtime += delta_exec;
schedstat_add(cfs_rq, exec_clock, delta_exec);
curr->vruntime += calc_delta_fair(delta_exec, curr);
update_min_vruntime(cfs_rq);

比较关心的是其中使用calc_delta_fair函数去根据当前task的weight值计算vruntime。

/*
* delta /= w
*/
static inline u64 calc_delta_fair(u64 delta, struct sched_entity *se)
{
if (unlikely(se->load.weight != NICE_0_LOAD))
delta = __calc_delta(delta, NICE_0_LOAD, &se->load);
return delta;
}

其最终结果为delta*1024/weight.

优先级高的进程其weight值大（参考prio_to_weight数组），所以其vruntime相对就增加的比较慢，实际运行时间就相对长一些。

接着回到关键函数pick_next_task_fair。

if (prev != p) {
struct sched_entity *pse = &prev->se;
while (!(cfs_rq = is_same_group(se, pse))) {
int se_depth = se->depth;
int pse_depth = pse->depth;
if (se_depth <= pse_depth) {
put_prev_entity(cfs_rq_of(pse), pse);
pse = parent_entity(pse);
}
if (se_depth >= pse_depth) {
set_next_entity(cfs_rq_of(se), se);
se = parent_entity(se);
}
}
put_prev_entity(cfs_rq, pse);
set_next_entity(cfs_rq, se);
}

上面这段代码看起来有点费解，还是关于cgroup的。上面讲到了一个cgroup有一个对于的sched_entity结构，其中成员my_q维护这这个cgroup里面的task。其原理跟task竞争CPU的使用权一样，cgourp内部的task竞争代表这个cgroup运行的权利，就像CPU上每次只能运行一个task一样，cgroup内部每次只能选择一个task来运行。同样的原理，选择vruntime最小的进程。

下面继续select_task_rq_fair，用于给一个新创建的task或者新wakeup的task选择新的CPU。其中prev_cpu为进程p上一次运行的cpu。

static int
select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_flags)
{
struct sched_domain *tmp, *affine_sd = NULL, *sd = NULL;
int cpu = smp_processor_id();
int new_cpu = prev_cpu;
int want_affine = 0;
int sync = wake_flags & WF_SYNC;
if (sd_flag & SD_BALANCE_WAKE)
want_affine = !wake_wide(p) && cpumask_test_cpu(cpu, tsk_cpus_allowed(p));
rcu_read_lock();
for_each_domain(cpu, tmp) {
if (!(tmp->flags & SD_LOAD_BALANCE))
break;
/*
* If both cpu and prev_cpu are part of this domain,
* cpu is a valid SD_WAKE_AFFINE target.
*/
if (want_affine && (tmp->flags & SD_WAKE_AFFINE) &&
cpumask_test_cpu(prev_cpu, sched_domain_span(tmp))) {
affine_sd = tmp;
break;
}

在这里面涉及到两个新的结构体struct sched_domain,sched_group.这是调度器相关的两个结构体。下面是我花的一个典型的8核ARM B-L架构的sched_domain/sched_group的相关示意图。

由于这颗SOC有两个cluster，每个cluster有4个CPU。所以sched_domain分为两个level，第一个level有一个sched_domain的mask为0XFF，这个sched_domain一共有两个sched_group，分别代表两个cluster，两个sched_group的cou mask分别为0X0F跟0XF0.

第二个level有两个sched_domain，他们的cpu mask分别为0xF0,0x0F.每个sched_domain下面有4个sched_group。第一个sched_domain的4个sched_group的cpu mask分别为0x1,0x2,0x4,0x8.分别用来表示第一个cluster上的4个cpu。第二个sched_domain的4个sched_group的cpu mask分别为0x10，0x20，0x40，0x80，用来表示第二个cluster上的4个CPU。如上图所示。

这上面有一个sync = wake_flags & WF_SYNC的判断，其中WF_SYNC用于表示是否需要sync wakeup。什么叫sync ？就是进程A唤醒进程B，然后进程A就进入休眠状态。由于sync的特殊性，所以在满足一定的条件下，倾向于将进程B放在原来进程A运行的CPU上。

这些sched_domian有从属关系，显然cpu mask为0x0F的domain从属于0xFF的domain。

他们之间用parent，child指针来表示这种关系。

除去wake affine相关的代码之外，其余核心代码如下。

} else while (sd) {
struct sched_group *group;
int weight;
if (!(sd->flags & sd_flag)) {
sd = sd->child;
continue;
}
group = find_idlest_group(sd, p, cpu, sd_flag);
if (!group) {
sd = sd->child;
continue;
}
new_cpu = find_idlest_cpu(group, p, cpu);
if (new_cpu == -1 || new_cpu == cpu) {
/* Now try balancing at a lower domain level of cpu */
sd = sd->child;
continue;
}
/* Now try balancing at a lower domain level of new_cpu */
cpu = new_cpu;
weight = sd->span_weight;
sd = NULL;
for_each_domain(cpu, tmp) {
if (weight <= tmp->span_weight)
break;
if (tmp->flags & sd_flag)
sd = tmp;
}

上面的代码用于从sched domain中找出一个idlest的group，然后从这个group中找出idlest的CPU。这基本是基于load balance的考虑，即保持每一个cpu上的负载均衡，从而最大化吞吐。

至于find_idlest_group以及find_idlest_cpu在这里不准备展开，既然涉及到查找idlest cpu，也就是负载最小的CPU。那么评判的标准是什么？如何定义一个CPU是负载最小的呢？

这就要从上面从find idlest cpu中用到的函数weighted_cpuload切入。

/* Used instead of source_load when we know the type == 0 */

static unsigned long weighted_cpuload(const int cpu)

{

return cfs_rq_runnable_load_avg(&cpu_rq(cpu)->cfs);

}

static inline unsigned long cfs_rq_runnable_load_avg(struct cfs_rq *cfs_rq)

{

return cfs_rq->runnable_load_avg;

}

这里面涉及到了cfs runqueue的一个变量叫着runnable_load_avg.同时跟task对于的sched_avg中间还有一个成员叫load_avg。用来表示一个task的负载。

我们知道每一个task_struct有一个对于的sched_entity。但是一个sched entity并不一定对应这一个task，也有可能是一个group。

提一下，不管是cfs rq，还是rt rq，dl rq。在rq中的都是出于runnable的等待运行的task。

一旦一个task进入睡眠状态，那么就从队列中删除，这就是dequeue的作用。同样，一个进程被唤醒的话，就会被插入到队列中，也就是enqueue的作用。关于如何计算每一个进程的load，linux CFS有一套模型，也就是PELT（per entity load tracking）.

阅读(2259) | 评论(0) | 转发(0) |

上一篇：linux调度器（二）- sched_class函数介绍

下一篇：linux调度器（四）- PELT（per entity load tracking）

给主人留下些什么吧！~~

感谢所有关心和支持过ChinaUnix的朋友们

16024965号-6