linux进程管理之调度与切换-danforn-ChinaUnix博客

二元空间bluce.blog.chinaunix.net

首页　| 　博文目录　| 　关于我

danforn

博客访问： 594963
博文数量： 146
博客积分： 5251
博客等级：大校
技术积分： 1767
用户组：普通用户
注册时间： 2006-11-10 15:58

文章分类

全部博文（146）

嵌入式开发（4）
Linux开发（8）
Linux Shell（2）
C C++（5）
Unix Linux资源库（2）
Linux System Adm（2）
Linux Distributi（2）

Gentoo（1）

Ubuntu（0）

SUSE（1）

Fedora Core（0）
Linux Network（68）

Network Protocol（13）

Network Security（2）

Network Manageme（11）

Network Developm（38）
Linux Device Dri（13）
Linux Kernel（39）
未分配的博文（1）

文章存档

2010年（12）

2008年（129）

2007年（5）

我的朋友

相关博文

linux进程管理之调度与切换

分类： LINUX

2008-11-11 16:58:02

进程的调度与切换直接影响着进程子系统的执行效率.Linux摒弃了i386 硬件提供的进程切换方法.手动保存进程上下文.在调度策略上,近几个版本对其都有很大的改动.特别是在2.6.23版本与以前发布的2.6.0更是相差甚远.在调度方面.我们以2.6.9在代码作为基准作为分析.

一:进程切换

进程的切换过程是在context_switch()中实现的.从它的代码说起:

static inline void

context_switch(struct rq *rq, struct task_struct *prev,

struct task_struct *next)

{

struct mm_struct *mm, *oldmm;

prepare_task_switch(rq, prev, next);

mm = next->mm;

oldmm = prev->active_mm;

* For paravirt, this is coupled with an exit in switch_to to

* combine the page table reload and the switch backend into

* one hypercall.

arch_enter_lazy_cpu_mode();

//task->mm 为空.则是一个内核线程

if (unlikely(!mm)) {

//内核线程共享上一个运行进程的mm

next->active_mm = oldmm;

//增加引用计数

atomic_inc(&oldmm->mm_count);

enter_lazy_tlb(oldmm, next);

} else

//如果是用户进程,则切换运行空间

oldmm, mm, next);

//如果上一个运行进程是内核线程

if (unlikely(!prev->mm)) {

//赋active_mm为空.

prev->active_mm = NULL;

//更新运行队列的prev_mm成员

rq->prev_mm = oldmm;

}

* Since the runqueue lock will be released by the next

* task (which is an invalid locking op but in the case

* of the scheduler it's an obvious special-case), so we

* do an early lockdep release here:

#ifndef __ARCH_WANT_UNLOCKED_CTXSW

spin_release(&rq->lock.dep_map, 1, _THIS_IP_);

#endif

/* Here we just switch the register state and the stack. */

//切换进程的执行环境

switch_to(prev, next, prev);

barrier();

* this_rq must be evaluated again because prev may have moved

* CPUs since it called schedule(), thus the 'rq' on its stack

* frame will be invalid.

//进程切换之后的处理工作

finish_task_switch(this_rq(), prev);

}

实际上，进程切换包括进程的执行环境切换和运行空间的切换。运行空间的切换是由switch_mm()完成的。代码如下：

//切换进程的执行空间

static inline void switch_mm(struct mm_struct *prev,

struct mm_struct *next,

struct task_struct *tsk)

{

//得到当前进程运行的cpu

int cpu = smp_processor_id();

//如果要要切换的prev!=next 执行切换过程

if (likely(prev != next)) {

/* stop flush ipis for the previous mm */

//清除prev的cpu_vm_mask标志.表示prev已经弃用了当前CPU

cpu_clear(cpu, prev->cpu_vm_mask);

#ifdef CONFIG_SMP

//在SMP系统中.更新cpu_tlbstate

per_cpu(cpu_tlbstate, cpu).state = TLBSTATE_OK;

per_cpu(cpu_tlbstate, cpu).active_mm = next;

#endif

//设置cpu_vm_mask.表示next占用了当前CPU

cpu_set(cpu, next->cpu_vm_mask);

/* Re-load page tables */

//加载CR3

load_cr3(next->pgd);

* load the LDT, if the LDT is different:

//如果LDT不相同,还要加载LDT

if (unlikely(prev->context.ldt != next->context.ldt))

load_LDT_nolock(&next->context);

}

#ifdef CONFIG_SMP

else {

per_cpu(cpu_tlbstate, cpu).state = TLBSTATE_OK;

//prev == next .那当前cpu中的active_mm就是prev. 也即next

BUG_ON(per_cpu(cpu_tlbstate, cpu).active_mm != next);

//在SMP系统中.虽然MM是一样的,但需要加载CR3

//执行cpu_test_and_set()来判断next是否正运行在此CPU上,这里是判断在切换之前next

//是否运行在当前的cpu中

//假设当前cpu为1..一个进程在1上执行的时候,被调度出来.再次调度的时候

//又发生在cpu 1

if (!cpu_test_and_set(cpu, next->cpu_vm_mask)) {

/* We were in lazy tlb mode and leave_mm disabled

* tlb flush IPI delivery. We must reload %cr3.

load_cr3(next->pgd);

load_LDT_nolock(&next->context);

}

#endif

}

我们在上面的代码分析中可以看到。它主要是切换了进程的CR3。

执行环境的切换是在switch_to()中完成的。它的代码如下：

#define switch_to(prev,next,last) do { \

unsigned long esi,edi; \

asm volatile("pushfl\n\t" /* Save flags */ \

"pushl %%ebp\n\t" \

"movl %%esp,%0\n\t" /* save ESP */ \

"movl %5,%%esp\n\t" /* restore ESP */ \

"movl $1f,%1\n\t" /* save EIP */ \

"pushl %6\n\t" /* restore EIP */ \

"jmp __switch_to\n" \

"1:\t" \

"popl %%ebp\n\t" \

"popfl" \

:"=m" (prev->thread.esp),"=m" (prev->thread.eip), \

"=a" (last),"=S" (esi),"=D" (edi) \

:"m" (next->thread.esp),"m" (next->thread.eip), \

"2" (prev), "d" (next)); \

} while (0)

这段代码是用AT&T嵌入式汇编写的。如果对相关语法不熟悉，请查阅相关资料。它的运行流程如下：

它总共有9个参数。程序开始时，先把这9个参数压栈。

后面一个冒号跟的是输入部。它表示程序开始运时的一系列初始化。初始化如下：

Prev à eax

Next à edx

它的执行流程如下：

Flag寄存器入栈(pushfl)

Ebp入栈(pushl %%ebp.在C中。Ebp寄存器用来存放当前的运行函数的帧指针)

Esp à prev->tread.esp(movel &&esp, %0 保存当前寄存器指针到prev->tread.esp)

Next->tread.esp à esp(将next->tread.esp加载esp寄存器.这一步实际上是加载next进程的内核栈。运行到这里。理论上prev已经被切换到next了。因为因此如果用curret获得当前进程应该是next)

将标号为1的地址 à prev->thread.eip (movl $1f,%1.虽然理论是进程是切换完成了，但是需要保存prev的执行环境。这里是指明prev进程的下一条运行指令。这样如果切换进prev进程的话，就会从标号1的地址开始运行)

Next->thread.eip压栈(pushl %6)

跳转到__switch_to()函数(jmp __switch_to)

这里要注意的是__switch_to函数是经过regparm(3)来修饰的。这个是GCC的一个扩展语法。即从eax.ebx.ecx寄存器取函数的参数。这样：__switch_to()函数的参数指是从寄存器里取的。并不是像普通函数那样从当前堆栈中取得.在调用函数__switch_to()之前。将next->thread.eip压栈了。这样从函数返回之后，它的下一条运行指令就是next->thread.eip了

我们可以想象一下：

对于新创建的进程。我们设置了 p->thread.eip = (unsigned long) ret_from_fork。这样子进程被切换进来之后，就会通过ret_from_fork返回到用户空间了。(详情请参照本站的相关文档)

对于已经运行的进程。我们在这里可以看到，在进程被切换出去的时候，prev->thread.eip会被设置成标号1的地址。即是从标号1的地址开始运行的。

标号1的操作：

恢复ebp (popl %%ebp)

恢复flags (popfl)

这样就恢复了进程的执行环境

这段代码有很多让人疑惑的地方：

在进程切换的时候，只是显示保存了flags esp和 ebp寄存。显示的用到了eax.edx.那其它寄存器怎么保存的呢？

实际上，过程切换只是发生在内核态中。对于内核态中的寄存器来说。它们的段寄存器都是一样的。所以不需要保存。

另外：对于通用寄存器。Esi edi ecx怎么保存的呢？我查阅了相关资料。没有发现能完全解释清楚的。现将我各人意见所列如下：

Esi.edi寄存器：有人认为在输出部中有这两句："=S" (esi),"=D" (edi) 认为会将esi edi压栈。这际上这是错误的。压栈的只是esi edi变量(在嵌入代码之前定义的).而不是esi edi寄存器。在这里。它只是在嵌入代码运行之后。将esi edi寄存器的内容分别放到esi edi变量中而已。在我用gcc –S 测试结果看来也确实如此。

那即然这样。Esi edi ecx寄存器怎么被保存。以便将来在prev运行的时候被恢复呢？

我认为：esi edi ecx没有被保存。也不用保存。

我们来看一下切换的整个过程：

在switch_to()中执行环境的切换过程。在prev被切换回来之后。是从switch_to中标号1的地址继续执行的。这时候已经被切换回了进程prev的内核栈。假设此时没有保存其它的寄存器(flags ,ebp是必须要保存的哦 ^_^ ).它的后续操作是barrier()(GCC的一段嵌入代码。声明内存已经变动.它实际上没有任何操作，这跟寄存器没关系).调用函数finish_task_switch().再从context_switch()函数中返回。再从schedule()中返回。再继续执行prev进程流。

我们知道，C函数的调用机制中。在调用之前先会执行进程环境的保存，再推入参数。最后推入下一条执令地址。然后跳转到函数地址。从函数中返回之后。会将环境恢复。继续执行。

这样我们知道。在switch_to()后的finish_task_switch()的函数调用是不受寄存器影响的(因为这是个函数。在进入函数后相当是一个全新的“世界”).然后从context_switch()返回之后就会恢复schedule()中的执行环境了。到了schedule()之后，这时候的环境跟之前切换出去的环境是完全一样的了.

当然，这样做的前提时：在switch_to()之后的代码对寄存器没有依赖。也就是全为函数。假设在switch_to()之后，有对ecx .edi esi寄存器的读取操作就会产生错误了。

在switch_to中会调用__switch_to()。它的代码如下所示：

struct task_struct fastcall * __switch_to(struct task_struct *prev_p, struct task_struct *next_p)

{

struct thread_struct *prev = &prev_p->thread,

*next = &next_p->thread;

int cpu = smp_processor_id();

struct tss_struct *tss = &per_cpu(init_tss, cpu);

/* never put a printk in __switch_to... printk() calls wake_up*() indirectly */

//MMX,FPU,XMM寄存器的相关操作

//我们在前面已经详细分析过这个函数。详细请参阅本站相关文档

__unlazy_fpu(prev_p);

/* we're going to use this soon, after a few expensive things */

if (next_p->fpu_counter > 5)

prefetch(&next->i387.fxsave);

* Reload esp0.

//将next->thread.esp0 --> tss

//从用户态切换到内核态的时候,将此值加载esp寄存器.即为进程的内核栈顶位置

load_esp0(tss, next);

* Save away %gs. No need to save %fs, as it was saved on the

* stack on entry. No need to save %es and %ds, as those are

* always kernel segments while inside the kernel. Doing this

* before setting the new TLS descriptors avoids the situation

* where we temporarily have non-reloadable segments in %fs

* and %gs. This could be an issue if the NMI handler ever

* used %fs or %gs (it does not today), or if the kernel is

* running inside of a hypervisor layer.

//为 prev保存gs

savesegment(gs, prev->gs);

* Load the per-thread Thread-Local Storage descriptor.

//从next的tls_array 缓存中加载线程的Thread-Local Storage描述符

load_TLS(next, cpu);

* Restore IOPL if needed. In normal use, the flags restore

* in the switch assembly will handle this. But if the kernel

* is running virtualized at a non-zero CPL, the popf will

* not restore flags, so it must be done in a separate step.

//如果当前特权级别是0并且prev->iopl != next->iopl则恢复IOPL设置set_iopl_mask(next->iopl)。

if (get_kernel_rpl() && unlikely(prev->iopl != next->iopl))

set_iopl_mask(next->iopl);

* Now maybe handle debug registers and/or IO bitmaps

// 根据thread_info的TIF标志_TIF_WORK_CTXSW和TIF_IO_BITMAP判断是否需要处理debug寄存器和IO位图

if (unlikely(task_thread_info(prev_p)->flags & _TIF_WORK_CTXSW_PREV ||

task_thread_info(next_p)->flags & _TIF_WORK_CTXSW_NEXT))

__switch_to_xtra(prev_p, next_p, tss);

* Leave lazy mode, flushing any hypercalls made here.

* This must be done before restoring TLS segments so

* the GDT and LDT are properly updated, and must be

* done before math_state_restore, so the TS bit is up

* to date.

//arch_leave_lazy_cpu_mode()设置CPU的lazy模式

arch_leave_lazy_cpu_mode();

/* If the task has used fpu the last 5 timeslices, just do a full

* restore of the math state immediately to avoid the trap; the

* chances of needing FPU soon are obviously high now

// 如果next_p->fpu_counter > 5则恢复next_p的FPU寄存器内容

if (next_p->fpu_counter > 5)

math_state_restore();

* Restore %gs if needed (which is common)

//如果需要，恢复gs寄存器

if (prev->gs | next->gs)

loadsegment(gs, next->gs);

//把当前的task写入current_task(一个per-cpu变量)

x86_write_percpu(current_task, next_p);

return prev_p;

}

这个函数涉及到很多硬件相关的东西。需要参阅x86的架构体系方面的资料才能完全理解。

进程切换完了之后，还得要进行一些清理工作，这是由finish_task_switch()完成的。它的代码如下：

static inline void finish_task_switch(struct rq *rq, struct task_struct *prev)

__releases(rq->lock)

{

struct mm_struct *mm = rq->prev_mm;

long prev_state;

//清空运行队列的prev_mm .

rq->prev_mm = NULL;

* A task struct has one reference for the use as "current".

* If a task dies, then it sets TASK_DEAD in tsk->state and calls

* schedule one last time. The schedule call will never return, and

* the scheduled task must drop that reference.

* The test for TASK_DEAD must occur while the runqueue locks are

* still held, otherwise prev could be scheduled on another cpu, die

* there before we look at prev->state, and then the reference would

* be dropped twice.

* Manfred Spraul

prev_state = prev->state;

finish_arch_switch(prev);

finish_lock_switch(rq, prev);

fire_sched_in_preempt_notifiers(current);

//mmdrop:减少mm的引用计数,如果引用计数为零,释放之

if (mm)

mmdrop(mm);

//如果过时程已经终止

if (unlikely(prev_state == TASK_DEAD)) {

* Remove function-return probe instances associated with this

* task and put them back on the free list.

//释放该进程的task

kprobe_flush_task(prev);

put_task_struct(prev);

}

这个函数比较简单。不需要做详细分析了。

到这里，进程切换已经分析完了。那进程什么时候切换呢？切换的依据又是什么呢？这就是我们接下来要分析的进程调度方面的内容了。

二：进程调度

进程调度在近几个版本中都进行了重要的修改。我们以2.6.9版为例过行分析。在进行具体的代码分析之前。我们先学习一下关于进程调度的原理。

1：进程类型：

在linux调度算法中，将进程分为两种类型。即：I/O消耗型和CPU消耗型。例如文本处理程序与正在执行的Make的程序。文本处理程序大部份时间都在等待I/O设备的输入,而make程序大部份时间都在CPU的处理上。因此为了提高响应速度，I/O消耗程序应该有较高的优先级，才能提高它的交互性。相反的，Make程序相比之下就不那么重要了。只要它能处理完就行了。因此，基于这样的原理，linux有一套交互程序的判断机制。

在task_struct结构中新增了一个成员：sleep_avg.此值初始值为100。进程在CPU上执行时，此值减少。当进程在等待时，此值增加。最后，在调度的时候。根据sleep_avg的值重新计算优先级。

2：进程优先级

正如我们在上面所说的：交互性强的需要高优先级，交互性弱的需要低优先级。在linux系统中，有两种优先级：普通优先级和实时优先级。我们在这里主要分析的是普通优先级，实时优先级部份可自行了解。

3：运行时间片

进程的时间片是指进程在抢占前可以持续运行的时间。在linux中,时间片长短可根据优先级来调度。进程不一定要一次运行完所有的时间片。可以在运时的中途被切换出去。

4：进程抢占

当一个进程被设为TASK_RUNING状态时。它会判断它的优先级是否高于正在运行的进程。如果是，则设置调度标志位，调用schedule()执行进程的调度。当一个进程的时间片为0时，也会执行进程抢占.

重要的数据结构：

在2。6。9中，每个CPU都有一个运行队列，这样就避免了竞争所带来的额外开销。运行队列的结构如下所示：

struct runqueue {

spinlock_t lock;

* nr_running and cpu_load should be in the same cacheline because

* remote CPUs use both these fields when doing load calculation.

//运行队列中进程数目

unsigned long nr_running;

#ifdef CONFIG_SMP

unsigned long cpu_load;

#endif

//进程切换的总数

unsigned long long nr_switches;

//当新一轮的时间片递减开始后，

//expired_timestamp变量记录着最早发生的进程耗完时间片事件的时间

//nr_uninterruptible: 记录本 CPU 尚处于 TASK_UNINTERRUPTIBLE 状态的进程数，和负载信息有关

unsigned long expired_timestamp, nr_uninterruptible;

//本就绪队列最近一次发生调度事件的时间

unsigned long long timestamp_last_tick;

//当前CPU上运行的进程和指定的空闲进程

task_t *curr, *idle;

//上一次调度时的MM结构

struct mm_struct *prev_mm;

//两个队列:active与expired

//active记录的是活动的队列

//expired:记录的是时间片运行完了的队列

prio_array_t *active, *expired, arrays[2];

//记录 expired 就绪进程组中的最高优先级（数值最小)

// 该变量在进程进入 expired 队列的时候保存

int best_expired_prio;

//记录本 CPU 因等待 IO 而处于休眠状态的进程数

atomic_t nr_iowait;

……

}

prio_array_t的定义如下：

struct prio_array {

//该队列的进程数

unsigned int nr_active;

//位图.每一位表示一个队列,如果该队列有可以运行的进程,置该位为1

unsigned long bitmap[BITMAP_SIZE];

//运行队列

struct list_head queue[MAX_PRIO];

};

进程调度的实现

对调度有了一般的了解之后，我们来看下进程调度到底是怎么样实现的：

进程调度由schedule()完成,它的代码如下：

asmlinkage void __sched schedule(void)

{

long *switch_count;

task_t *prev, *next;

runqueue_t *rq;

prio_array_t *array;

struct list_head *queue;

unsigned long long now;

unsigned long run_time;

int cpu, idx;

//WARN_ON(system_state == SYSTEM_BOOTING);

* Test if we are atomic. Since do_exit() needs to call into

* schedule() atomically, we ignore that path for now.

* Otherwise, whine if we are scheduling when we should not be.

//in_atomic:判断是否在一个中断上下文或者是原子上下文

if (likely(!(current->state & (TASK_DEAD | TASK_ZOMBIE)))) {

if (unlikely(in_atomic())) {

printk(KERN_ERR "bad: scheduling while atomic!\n");

dump_stack();

}

need_resched:

//禁止抢占

preempt_disable();

prev = current;

//得到当前CPU上的运行队列

rq = this_rq();

* The idle thread is not allowed to schedule!

* Remove this check after it has been exercised a bit.

//如果当前进程是空闲进程而且不处于运行状态.

//BUG:

if (unlikely(current == rq->idle) && current->state != TASK_RUNNING) {

printk(KERN_ERR "bad: scheduling from the idle thread!\n");

dump_stack();

}

release_kernel_lock(prev);

schedstat_inc(rq, sched_cnt);

now = sched_clock();

//timestamp: 1:切换上去的时间戳

// 2:切换下来的时间戳

// 3:被唤醒的时间恰戳

//prev是当前正在运行的进程.这个timestamp应该是表示被切换上去的时间戳

//当前运行的时间.如果超过了NS_MAX_SLEEP_AVG. 取运行时间为NS_MAX_SLEEP_AVG

if (likely(now - prev->timestamp < NS_MAX_SLEEP_AVG))

run_time = now - prev->timestamp;

else

run_time = NS_MAX_SLEEP_AVG;

* Tasks with interactive credits get charged less run_time

* at high sleep_avg to delay them losing their interactive

* status

//HIGH_CREDIT()判断是否是一个交互式进程

//interactive_credit 超过了阀值

//如果是交互式进程，则它参与优先级计算的运行时间会比实际运行时间小

//以此获得较高的优先级

if (HIGH_CREDIT(prev))

run_time /= (CURRENT_BONUS(prev) ? : 1);

spin_lock_irq(&rq->lock);

* if entering off of a kernel preemption go straight

* to picking the next task.

switch_count = &prev->nivcsw;

//state: -1:不可运行 0:可运行的 >0 暂停的

if (prev->state && !(preempt_count() & PREEMPT_ACTIVE)) {

switch_count = &prev->nvcsw;

//可中断且有末处理的信号.激活之.

if (unlikely((prev->state & TASK_INTERRUPTIBLE) &&

unlikely(signal_pending(prev))))

prev->state = TASK_RUNNING;

else

//从运行队列中移除

deactivate_task(prev, rq);

}

//当前CPU

cpu = smp_processor_id();

//运行队列中没有可供运行的进程了

if (unlikely(!rq->nr_running)) {

go_idle:

//如果没有可运行的进程了.从其它CPU上"搬"进程过来

idle_balance(cpu, rq);

//如果还是没有可运行进程.则将Next赋值为空闲进程.

//再跳转到switch_tasks.将空闲进程切换进去

if (!rq->nr_running) {

next = rq->idle;

rq->expired_timestamp = 0;

wake_sleeping_dependent(cpu, rq);

* wake_sleeping_dependent() might have released

* the runqueue, so break out if we got new

* tasks meanwhile:

if (!rq->nr_running)

goto switch_tasks;

}

} else {

if (dependent_sleeper(cpu, rq)) {

schedstat_inc(rq, sched_goidle);

next = rq->idle;

goto switch_tasks;

}

* dependent_sleeper() releases and reacquires the runqueue

* lock, hence go into the idle loop if the rq went

* empty meanwhile:

if (unlikely(!rq->nr_running))

goto go_idle;

}

//活动队列

array = rq->active;

//如果活动队列中进程数为0.

if (unlikely(!array->nr_active)) {

* Switch the active and expired arrays.

//对换expired 与active

schedstat_inc(rq, sched_switch);

rq->active = rq->expired;

rq->expired = array;

array = rq->active;

rq->expired_timestamp = 0;

rq->best_expired_prio = MAX_PRIO;

} else

schedstat_inc(rq, sched_noswitch);

//在活动队列组中取第一个不为空的队列

idx = sched_find_first_bit(array->bitmap);

queue = array->queue + idx;

next = list_entry(queue->next, task_t, run_list);

//next进程即是将要被切换进的进程

//rt_task():判断是否是一个实时进程

if (!rt_task(next) && next->activated > 0) {

unsigned long long delta = now - next->timestamp;

//task->activated:表示进程因什么原因进入就绪态，这一原因会影响到调度优先级的计算

//-1，进程从 TASK_UNINTERRUPTIBLE 状态被唤醒

//0，缺省值，进程原本就处于就绪态

// 1，进程从 TASK_INTERRUPTIBLE 状态被唤醒，且不在中断上下文中

// 2，进程从 TASK_INTERRUPTIBLE 状态被唤醒，且在中断上下文中

//如果是中断服务程序调用的 activate_task()，也就是说进程由中断激活，则该进程最有可能是交互式的，因此，置 activated=2；否则置activated=1。

//如果进程是从 TASK_UNINTERRUPTIBLE 状态中被唤醒的，则 activated=-1（在try_to_wake_up()函数中）。

if (next->activated == 1)

delta = delta * (ON_RUNQUEUE_WEIGHT * 128 / 100) / 128;

array = next->array;

//将进程移出队列

dequeue_task(next, array);

//重新计算优先级

recalc_task_prio(next, next->timestamp + delta);

//重新插入运行队列

enqueue_task(next, array);

}

next->activated = 0;

switch_tasks:

prefetch(next);

//清除调度标志

clear_tsk_need_resched(prev);

rcu_qsctr_inc(task_cpu(prev));

//更新prev->sleep_avg

prev->sleep_avg -= run_time;

//如果sleep_avg小于零

if ((long)prev->sleep_avg <= 0) {

prev->sleep_avg = 0;

if (!(HIGH_CREDIT(prev) || LOW_CREDIT(prev)))

prev->interactive_credit--;

}

prev->timestamp = prev->last_ran = now;

sched_info_switch(prev, next);

//如果prev与next不相等,则执行切换过程

if (likely(prev != next)) {

next->timestamp = now;

rq->nr_switches++;

rq->curr = next;

++*switch_count;

//执行进程的切换过程

prepare_arch_switch(rq, next);

prev = context_switch(rq, prev, next);

barrier();

finish_task_switch(prev);

} else

spin_unlock_irq(&rq->lock);

reacquire_kernel_lock(current);

preempt_enable_no_resched();

//如果还有调度标志.那就重新执行调度过程.

//这个调度标志可能是在别处被设置的

if (unlikely(test_thread_flag(TIF_NEED_RESCHED)))

goto need_resched;

}

在上面的代码中涉及到了SMP的负载平衡，动态优先级的知识。

所谓SMP负载平衡是指，当一个CPU的运行进程过多，但是另一个CPU的进程过少。需要在两者之间平衡运行的进程数。显而易见，这样可以提高系统的运行效率。

动态优先级中涉及到很多的公式，在这里不加详细分析，请自行了解。

我们接着看，如果一个进程的时间片完了，会进行怎么的处理。这个处理过程是在时间中断中完成的。详细请参阅本站的相关分析。

在时间中断处理程序中，会调用sheduler_tick()对当前进程运行的时间片做特定的处理。它的代码如下：

void scheduler_tick(int user_ticks, int sys_ticks)

{

//当前CPU

int cpu = smp_processor_id();

struct cpu_usage_stat *cpustat = &kstat_this_cpu.cpustat;

//取得当前CPU的运行队列

runqueue_t *rq = this_rq();

task_t *p = current;

//更新timestamp_last_tick为当前时间

rq->timestamp_last_tick = sched_clock();

if (rcu_pending(cpu))

rcu_check_callbacks(cpu, user_ticks);

/* note: this timer irq context must be accounted for as well */

if (hardirq_count() - HARDIRQ_OFFSET) {

cpustat->irq += sys_ticks;

sys_ticks = 0;

} else if (softirq_count()) {

cpustat->softirq += sys_ticks;

sys_ticks = 0;

}

//如果当前进程为空闲进程

if (p == rq->idle) {

if (atomic_read(&rq->nr_iowait) > 0)

cpustat->iowait += sys_ticks;

else

cpustat->idle += sys_ticks;

if (wake_priority_sleeper(rq))

goto out;

rebalance_tick(cpu, rq, IDLE);

return;

}

if (TASK_NICE(p) > 0)

cpustat->nice += user_ticks;

else

cpustat->user += user_ticks;

cpustat->system += sys_ticks;

/* Task might have expired already, but not scheduled off yet */

//如果当前进程不是在活动队列中

//说明P是过期但末移除的进程

if (p->array != rq->active) {

//设置调度标志位

set_tsk_need_resched(p);

goto out;

}

spin_lock(&rq->lock);

* The task was running during this tick - update the

* time slice counter. Note: we do not update a thread's

* priority until it either goes to sleep or uses up its

* timeslice. This makes it possible for interactive tasks

* to use up their timeslices at their highest priority levels.

//实时进程的处理

if (rt_task(p)) {

* RR tasks need a special form of timeslice management.

* FIFO tasks have no timeslices.

if ((p->policy == SCHED_RR) && !--p->time_slice) {

p->time_slice = task_timeslice(p);

p->first_time_slice = 0;

set_tsk_need_resched(p);

/* put it at the end of the queue: */

dequeue_task(p, rq->active);

enqueue_task(p, rq->active);

}

goto out_unlock;

}

//如果进程的时间片用完

if (!--p->time_slice) {

//移出活动队列

dequeue_task(p, rq->active);

//设置调度标志

set_tsk_need_resched(p);

//重新计算优先级与时间片

p->prio = effective_prio(p);

p->time_slice = task_timeslice(p);

p->first_time_slice = 0;

//如果rq->expired_timestamp 为空,则设为当前时间

//expired_timestamp:表示运行队列中最先时间片耗完的时间

if (!rq->expired_timestamp)

rq->expired_timestamp = jiffies;

//在这里要注意了:

//如果是一个交互性很强的进程.如果它时间版运行完了.将它移至expired队列的

//话,那就必须要等active队列的进程运行完之后才能被调度.这样会影响用户的交互性

//判断是否是一个交互性很强的进程.如果不是,将其加至expired队列

//如果是,将其加至active队列

if (!TASK_INTERACTIVE(p) || EXPIRED_STARVING(rq)) {

//加入expired队列

enqueue_task(p, rq->expired);

if (p->static_prio < rq->best_expired_prio)

rq->best_expired_prio = p->static_prio;

} else

//移至active队列

enqueue_task(p, rq->active);

} else {

* Prevent a too long timeslice allowing a task to monopolize

* the CPU. We do this by splitting up the timeslice into

* smaller pieces.

* Note: this does not mean the task's timeslices expire or

* get lost in any way, they just might be preempted by

* another task of equal priority. (one with higher

* priority would have preempted this task already.) We

* requeue this task to the end of the list on this priority

* level, which is in essence a round-robin of tasks with

* equal priority.

* This only applies to tasks in the interactive

* delta range with at least TIMESLICE_GRANULARITY to requeue.

//当前进程的时间片没有运行完...

//如果进程的剩余时间超长.

if (TASK_INTERACTIVE(p) && !((task_timeslice(p) -

p->time_slice) % TIMESLICE_GRANULARITY(p)) &&

(p->time_slice >= TIMESLICE_GRANULARITY(p)) &&

(p->array == rq->active)) {

//从活动队列中移除

dequeue_task(p, rq->active);

//设置调度标志

set_tsk_need_resched(p);

//重新计算优先级

p->prio = effective_prio(p);

//重新插入活动队列

enqueue_task(p, rq->active);

}

out_unlock:

spin_unlock(&rq->lock);

out:

rebalance_tick(cpu, rq, NOT_IDLE);

}

总结：

操作系统用的进程调度算法是衡量一个操作系统的重要标志。在2.4之前，调度部份算法变动很少。但是在2.6版中的近几个版本中发生了很大的变化。其中重要的变化是支持内核抢占。内核抢占导致了内核的复杂性。很多代码都必须要保持重入。

阅读(2032) | 评论(0) | 转发(0) |

上一篇：linux进程管理之wait系统调用

下一篇：linux进程管理之信号处理

给主人留下些什么吧！~~

感谢所有关心和支持过ChinaUnix的朋友们

16024965号-6