The kernel, among the other time-related duties, must periodically collect various data used for:
*Checking the CPU resource limit of the running processes
*Updating statistics about the local CPU workload
*Computing the average system load
*Profiling the kernel code
内核,在所有和其他时间-相关的职责中,一定要定期的收集各种类型的数据用作:
*检查CPU的对运行进程的资源限制
*更新关于本地CPU的工作负载
*计算系统平均负载
*分析kernel代码
6.4.1. Updating Local CPU Statistics 更新当地CPU数据
We have mentioned that the update_process_times( ) function is invoked either by the global timer interrupt handler on uniprocessor systems or by the local timer interrupt handler in multiprocessor systems to update some kernel statistics. This function performs the following steps:
我们提到的update_process_times()函数在单处理器上的全局计时器中断处理程序调用或者被多核处理器的本地计时器中断处理程序调用来更新一些kernel 数据。这个函数以以下步骤进行:
一、Checks how long the current process has been running. Depending on whether the current process was running in User Mode or in Kernel Mode when the timer interrupt occurred, invokes either account_user_time( ) or account_system_time( ). Each of these functions performs essentially the following steps:
一、检查当前进程运行了多少时间。根据计时器中断发生时处于用户模式或者在kernel模式来调用account_user_time()或者account_system_time()。每一个函数实际上有下列几步:
1、Updates either the utime field (ticks spent in User Mode) or the stime field (ticks spent in Kernel Mode) of the current process descriptor. Two additional fields called cutime and cstime are provided in the process descriptor to count the number of CPU ticks spent by the process children in User Mode and Kernel Mode, respectively. For reasons of efficiency, these fields are not updated by update_process_times( ), but rather when the parent process queries the state of one of its children (see the section "Destroying Processes" in Chapter 3).
1、更新本进程描述符在utime(在用户运行的滴答)和stime域(在kernel 模式下的滴答)。两个额外的域叫做cutime和cstime在进程描述符中来计算CPU滴答用在子进程的用户模式或者kernel模式分别地。为了效率的原因,这些域不被update_process_times()函数更新,而是在父进程要求其子进程的状态时更新(见后面的3章中的“销毁进程");
2、Checks whether the total CPU time limit has been reached; if so, sends SIGXCPU and SIGKILL signals to current. The section "Process Resource Limits" in Chapter 3 describes how the limit is controlled by the signal->rlim[RLIMIT_CPU].rlim_curfield of each process descriptor.
2、检查是否一个整个CPU时间已经到了;如果是这样,那么就发送SIGXCUP 和SIGKILL信号给当前进程。在“进程资源限制”那一节第三章,那里描述了如何被进程描述符signal->rlim[RLIMIT_CPU].rlim_curfied
。
3、Invokes account_it_virt( ) and account_it_prof( ) to check the process timers (see the section "The setitimer( ) and alarm( ) System Calls" later in this chapter).
3、调用account_it_virt()函数和account_in_prof()函数去检测是否进程计数器(见"setitimer()和alarm()系统调用“在后面的章节"
4、 Updates some kernel statistics stored in the kstat per-CPU variable.
4、更新一些kernel存储在每个CPU的kstat 变量。
二、Invokes raise_softirq( ) to activate the TIMER_SOFTIRQ tasklet on the
local CPU (see the section "Software Timers and Delay Functions" later
in this chapter).
调用raise_softirq()函数去激活本地的TIMER_SOFTIRQ tasklet(见后面的”软件计数器和延迟函数“)
三、If some old version of an RCU-protected data structure has to be reclaimed, checks whether the local CPU has gone through a quiescent state and invokes tasklet_schedule( ) to activate the rcu_tasklet tasklet of the local CPU (see the section "Read-Copy Update (RCU)" in Chapter 5).
三 、在一些老版的RCU-保护的数据结构应该被回收,检查是否当地CPU应该通过一个静态的状态,调用tasklet_schedule()去激活本地CPU的rcu_tasklet机制(见读写更新,第5章).
四、 调用scheduler_tick()函数,这个函数减少当前进程的时间片,检查是否它的quantum被用尽。我们会“调度者 tick()函数“在第七章深入讲。
6.4.2. Keeping Track of System Load /*记录内核的负载*/
Every Unix kernel keeps track of how much CPU activity is being carried on by the system. These statistics are used by various administration utilities such as top. A user who enters the uptime command sees the statistics as the "load average" relative to the last minute, the last 5 minutes, and the last 15 minutes. On a uniprocessor system, a value of 0 means that there are no active processes (besides the swapper process 0) to run, while a value of 1 means that the CPU is 100 percent busy with a single process, and values greater than 1 mean that the CPU is shared among several active processes.[*]
每一个Unix kernel 跟踪有多少CPU活动被系统运行着。这些数据被用于不同的管理者使用工具如top。一个用户使用uptime命令来看最后一分钟的负载,最后5分钟和最后15分钟的。在单处理器系统中,0代表着没有活动的进程(除了交换进程0)运行,而1代表CPU100%忙于一个单个进程,如果值大于1代表着CPU在各个进程之间被共享。
[*]Linux includes in the load average all processes that are in the TASK_RUNNING and TASK_UNINTERRUPTIBLE states. However, under normal conditions, there are few TASK_UNINTERRUPTIBLE processes, so a high load usually means that the CPU is busy.
Linux 包括了所有的进程的平均负载这些负载在TASK_RUNNING和TASK_UNINTERRUPTABLE状态。但是,在一般的情况下,有很少的TASK_UNITERRUPTIBLE进程,所以一个高负载代表着CPU是很繁忙的。
At every tick, update_times( ) invokes the calc_load( ) function, which counts the number of processes in the TASK_RUNNING or TASK_UNINTERRUPTIBLE state and uses this number to update the average system load.
每一次滴答,update_times()调用calc_load()函数,这是加在TASK_RUNNING或者TASK_UNINTERRUPTABLE状态的进程数量。用这些值来更新系统平均负载。
6.4.3. Profiling the Kernel Code /*分析内核代码*/
Linux includes a minimalist code profiler called readprofile used by Linux developers to discover where the kernel spends its time in Kernel Mode. The profiler identifies the hot spots of the kernel the most frequently executed fragments of kernel code. Identifying the kernel hot spots is very important, because they may point out kernel functions that should be further optimized.
Linux包含了一个极简的code分析叫做readprofile被用linux开发者发现在哪里linux工作在kernle状态。这个分析器标识kernel hot spots是很重要的,因为这些可能指出kernel函数应该被进一步优化。
The profiler is based on a simple Monte Carlo algorithm: at every timer interrupt occurrence, the kernel determines whether the interrupt occurred in Kernel Mode; if so, the kernel fetches the value of the eip register before the interruption from the stack and uses it to discover what the kernel was doing before the interrupt. In the long run, the samples accumulate on the hot spots.
这个分析器基于简单的Monte Carlo 算法:每一个时钟中断发生时,kernel决定是否中断发生在Kernel模式;如果是,那么kernel读取eip寄存器的值在中断从堆栈返回之前,用它来发现是否kernel在中断前干什么。从长远来看,样本在hot spots上积累。
The profile_tick( ) function collects the data for the code profiler. It is invoked either by the do_timer_interrupt( ) function in uniprocessor systems (by the global timer interrupt handler) or by the smp_local_timer_interrupt( ) function in multiprocessor systems (by the local timer interrupt handler).
profile_tick()函数为code profile收集数据。被单处理器上的do_timer_interrupt()调用(被全局的计时器中断处理程序)或者被多核处理器上的smp_local_timer_interrupt()函数(本地计时器中断处理程序)
To enable the code profiler, the Linux kernel must be booted by passing as a parameter the string profile=N, where 2N denotes the size of the code fragments to be profiled. The collected data can be read from the /proc/profile file. The counters are reset by writing in the same file; in multiprocessor systems, writing into the file can also change the sample frequency (see the earlier section "Timekeeping Architecture in Multiprocessor Systems"). However, kernel developers do not usually access /proc/profile directly; instead, they use the readprofile system command.
为了使能code 分析器,linux kernel 启动时一定被传送一个字符串 profile = N,这里2N表示要被分析的代码框架大小。这个收集的数据可以在/proc/profile文件中读取。这个counters被重置这个文件被写入
。在多核处理器系统中,写入这个文件还会引起改变样本频率(见前面的计时架构在多处理器系统)。但是kernel开发者不会直接进入/proc/profile;他们使用readprofile系统命令。
The Linux 2.6 kernel includes yet another profiler called oprofile. Besides being more flexible and customizable than readprofile, oprofile can be used to discover hot spots in kernel code, User Mode applications, and system libraries. When oprofile is being used, profile_tick( ) invokes the timer_notify( ) function to collect the data used by this new profiler.
Linux 2.6内核包括了另一个分析器叫做oprofile,除了比readprofile更加灵活和可制定,openfile可以用来发现在kernel code 中发现热点。User 模式的函数,和系统连接库。当openfile 被使用,profile_tick()调用timer_notify()函数去收集被这个新的profile使用的数据。
6.4.4. Checking the NMI Watchdogs 检查NMI 看门狗
在 多处理器系统,Linux给kernel开发者提供了另一个特点:一个看门狗系统,这个可能对侦测内核kernel bugs 引起的kernel freeze.为了激活一个看门狗,kernel 一定要有nmi_watchdog变量来启动。
The watchdog is based on a clever hardware feature of local and I/O APICs: they can generate periodic NMI interrupts on every CPU. Because NMI interrupts are not masked by the cli assembly language instruction, the watchdog can detect deadlocks even when interrupts are disabled.
看门狗基于一个很聪明的本地和I/OAPICs硬件的特性:特们可以对每一个CPU产生定期的NMI中断。因为NMI中断是不能用cli 汇编语言指令屏蔽的,这个看门狗可以解决死锁甚至当中断被禁止。
As a consequence, once every tick, all CPUs, regardless of what they are doing, start executing the NMI interrupt handler; in turn, the handler invokes do_nmi( ). This function gets the logical number n of the CPU, and then checks the apic_timer_irqs field of the nth entry of irq_stat (see Table 4-8 in Chapter 4). If the CPU is working properly, the value must be different from the value read at the previous NMI interrupt. When the CPU is running properly, the nth entry of the apic_timer_irqs field is increased by the local timer interrupt handler (see the earlier section "The local timer interrupt handler"); if the counter is not increased, the local timer interrupt handler has not been executed in a whole tick. Not a good thing, you know.
结果是,每一次的滴答,所有的CPUS,不论他们在干什么,开始处理NMI中断处理程序;依次地,处理程序调用do_nmi()函数,这个函数获得本地cpu的逻辑号,检查第n个irq_stat的apic_timer_irq域,(见表4-8第四章)。如果CPU运行正常,这个值一定和先前被NMI中断读取的不同。当CPU运行正常时,第n个apic_timer_irqs域会被本地计时器中断处理程序加一。(见前面的本地计时器中断处理程序);如果这个计数值不增加,本地计时器处理程序没有在一个完成的滴答执行。不是好的事情,你知道的。
When the NMI interrupt handler detects a CPU freeze, it rings all the bells: it logs scary messages in the system logfiles, dumps the contents of the CPU registers and of the kernel stack (kernel oops), and finally kills the current process. This gives kernel developers a chance to discover what's gone wrong.
当NMI中断处理程序侦测到CPU freeze,它会响铃:它会在系统日志文件记录下来,复制CPU寄存器和kernel堆栈的值,最后杀死当前进程。给kernel开发者机会去发现那里出错了。
阅读(1760) | 评论(0) | 转发(0) |