Chapter 3. Process Management-config2010-ChinaUnix博客

config life

首页　| 　博文目录　| 　关于我

config2010

博客访问： 160511
博文数量： 28
博客积分： 2010
博客等级：大尉
技术积分： 239
用户组：普通用户
注册时间： 2009-11-07 01:15

文章分类

全部博文（28）

文章存档

2010年（24）

2009年（4）

我的朋友

Chapter 3. Process Management

第3章．进程管理

The process is one of the fundamental abstractions in Unix operating systems^[1]. A process is a program (object code stored on some media) in execution. Processes are, however, more than just the executing program code (often called the text section in Unix). They also include a set of resources such as open files and pending signals, internal kernel data, processor state, an address space, one or more threads of execution, and a data section containing global variables. Processes, in effect, are the living result of running program code.

进程是Unix系统最基本的抽象之一。一个进程就是执行期的程序（目标代码存储在媒体上）。但是进程远远不止包括执行期的程序（在Unix中称为代码段）。它们也括一套资源，比如打开文件、挂起的信号、内部内核数据、进程状态、地址空间、一条或者多条线程和包括全局变量的数据段等。实际上，进程就是正在执行程序代码的活标本。

The other fundamental abstraction is files.

^[1]另一个基本抽象是文件。

Threads of execution, often shortened to threads, are the objects of activity within the process. Each thread includes a unique program counter, process stack, and set of processor registers. The kernel schedules individual threads, not processes. In traditional Unix systems, each process consists of one thread. In modern systems, however, multithreaded programsthose that consist of more than one threadare common. As you will see later, Linux has a unique implementation of threads: It does not differentiate between threads and processes. To Linux, a thread is just a special kind of process.

执行线程，简称线程，是进程中活动的对象。每条线程包括一个统一的程序计数器、进程栈和一组进程寄存器。内核调度的对象是线程，不是进程。在传统的Unix系统中，每个进程包括一个线程。然而，在现在的系统中，包含多条线程的多线程程序司空见惯。在后面的章节你可以看到，Linux有一套独特的线程执行机制：它不区分进程与线程。对Linux而言，一条线程就是一种特殊的进程。

On modern operating systems, processes provide two virtualizations: a virtualized processor and virtual memory. The virtual processor gives the process the illusion that it alone monopolizes the system, despite possibly sharing the processor among dozens of other processes. Chapter 4, "Process Scheduling," discusses this virtualization. Virtual memory lets the process allocate and manage memory as if it alone owned all the memory in the system. Virtual memory is covered in Chapter 11, "Memory Management." Interestingly, note that threads share the virtual memory abstraction while each receives its own virtualized processor.

在现代的系统中，进程提供两种虚拟机制：虚拟进程和虚拟内存。尽管与其它的进程共享了资源，虚拟进程还是给进程一种假象，让它独占系统。在第4章的进程调度中讨论这种虚拟化。虚拟内存为进程分派和管理内存，就像它在系统中获得了整个内存。虚拟内存将在第11章内存管理讨论。有趣的是，注意在线程之间可以共享虚拟内存，但是各自拥用自己的虚拟处理器。

A program itself is not a process; a process is an active program and related resources. Indeed, two or more processes can exist that are executing the same program. In fact, two or more processes can exist that share various resources, such as open files or an address space.

程序本身不是进程；进程是处理执行期的程序和所包含的资源。事实上，两个或多个进程可以同进存在，且正在执行同一个程序。两个或者多个并存的进程能够存在，且分享不同的资源，比如打开的文件和地址空间等。

A process begins its life when, not surprisingly, it is created. In Linux, this occurs by means of the fork() system call, which creates a new process by duplicating an existing one. The process that calls fork() is the parent, whereas the new process is the child. The parent resumes execution and the child starts execution at the same place, where the call returns. The fork() system call returns from the kernel twice: once in the parent process and again in the newborn child.

无疑，进程在它被创建的时候开始存活。在Linux系统中，进程的创建是通过fork()系统调用，它复制一个存在的进程来创建一个新的进程。被fork()调用的进程是父进程，新的进程是子进程。在调用返回的地方，父进程继续执行，子进程开始执行。fork()调用从内核返回两次，一次返回在父进程，另一次返回在新诞生的子进程中。

Often, immediately after a fork it is desirable to execute a new, different, program. The exec*() family of function calls is used to create a new address space and load a new program into it. In modern Linux kernels, fork() is actually implemented via the clone() system call, which is discussed in a following section.

一般，创建进程是为了立即执行新的、不同的程序。exec*()函数族被调用，用来创建新的地址空间，并加载一个新的程序到里面去。在现在的Linux内核中，实际上fork()函数是通过clone()系统调用来实现，接下来的章节将要讨论它。

Finally, a program exits via the exit() system call. This function terminates the process and frees all its resources. A parent process can inquire about the status of a terminated child via the wait4()^[2] system call, which enables a process to wait for the termination of a specific process. When a process exits, it is placed into a special zombie state that is used to represent terminated processes until the parent calls wait() or waitpid().

最后，程序退出是通过exit()系统调用。函数终结进程并且释放它所有的资源。父进程通过wait4()系统调用，来查询子进程是否终结，这其实使得一个进程等待特定进程执行终结的能力。当一个进程退出，它将被设置在一个特定的僵死状态，这样可以用来描述终结的进程，直到它的父进程调用wait()或waitpid()为止。

The kernel implements the wait4() system call. Linux systems, via the C library, typically provide the wait(),waitpid(),wait3() , and wait4() functions. All these functions return status about a terminated process, albeit with slightly different semantics.

^[2] 内核实现wait4()系统调用。Linux系统通过C函数库通常要提供函数wait(),waitpid(),wait3(),和wait4()。所有这些函数返回终结进程的状态，即使有些细小的语义差别。

Another name for a process is a task. The Linux kernel internally refers to processes as tasks. In this book, I will use the terms interchangeably, although when I say task I am generally referring to a process from the kernel's point of view.

进程的另一个名字是任务。在Linux 内核中，进程可以认为是任务。在本书中，我将交替的使用这两种术语。不过我所说到的任务，通常是从内核观点所看到的进程。

Process Descriptor and the Task Structure

进程描述符及任务结构体

The kernel stores the list of processes in a circular doubly linked list called the task list^[3]. Each element in the task list is a process descriptor of the type struct task_struct, which is defined in . The process descriptor contains all the information about a specific process.

内核将进程队列存储在一个叫task list的双向链表中。task list中的每一个元素是struct task_struct类型的进程描述符，该结构被定义在。进程描述符包含一个具体进程的所有信息。

Some texts on operating system design call this list the task array. Because the Linux implementation is a linked list and not a static array, it is called the task list.

^[3] 有些关于操作系统的教材称这个队列为任务数组。因为Linux的实现是链表而不是静态数组，所以被称之为tast list。

The task_struct is a relatively large data structure, at around 1.7 kilobytes on a 32-bit machine. This size, however, is quite small considering that the structure contains all the information that the kernel has and needs about a process. The process descriptor contains the data that describes the executing program open files, the process's address space, pending signals, the process's state, and much more (see Figure 3.1).

task_struct是一个相对较大的数据结构，在32位机器上约1.7Kbytes。但是考滤到这个结构包含着内核有的和所需要的所有信息，这个大小也算得上非常不的了。这个进程描述符包含数据，如打开的文件、进程地址空间、挂起的信号、进程的状态等等（见3.1节），来描述正在执行的程序。

Figure 3.1. The process descriptor and task list.

Allocating the Process Descriptor

分配进程描述符

The task_struct structure is allocated via the slab allocator to provide object reuse and cache coloring (see Chapter 11, "Memory Management"). Prior to the 2.6 kernel series, struct task_struct was stored at the end of the kernel stack of each process. This allowed architectures with few registers, such as x86, to calculate the location of the process descriptor via the stack pointer without using an extra register to store the location. With the process descriptor now dynamically created via the slab allocator, a new structure, struct thread_info, was created that again lives at the bottom of the stack (for stacks that grow down) and at the top of the stack (for stacks that grow up)^[4]. See Figure 3.2. The new structure also makes it rather easy to calculate offsets of its values for use in assembly code.

Linux通过分配器slab来分配结构体task_struct来提供对象复用和角色缓存（见第11章内存管理）。Linux2.6以前的版本，结构体task_struct被存储在每个进程的内核栈的末端。这个允许X86这样的体系有少量的寄存器，在没有大量寄存器的前提下，通过计算进程描述符的地址。现在，分配器slab动态生成task_struct,所以只需要在栈底或者在栈顶创建一个新的结构struct thread_info。新的结构体用汇编让这它更简单的计算值的偏移地址。

^[4] 寄存器较弱的体系结构不是引入thread_struct结构的惟一原因。

Figure 3.2. The process descriptor and kernel stack.

The thread_info structure is defined on x86 in as

struct thread_info {

        struct task_struct    *task;

        struct exec_domain    *exec_domain;

        unsigned long         flags;

        unsigned long         status;

        __u32                 cpu;

        __s32                 preempt_count;

        mm_segment_t          addr_limit;

        struct restart_block  restart_block;

        unsigned long         previous_esp;

        __u8                  supervisor_stack[0];

};

Each task's tHRead_info structure is allocated at the end of its stack. The task element of the structure is a pointer to the task's actual task_struct.

在X86结构体thread_info上，寄存器是被定义在文件里。每个任务的结构thread_info被分配在栈的底部。任务结构体元素是一个指向实际task_struct的指针。

Storing the Process Descriptor

进程描述符的存放

The system identifies processes by a unique process identification value or PID. The PID is a numerical value that is represented by the opaque type^[5] pid_t, which is typically an int. Because of backward compatibility with earlier Unix and Linux versions, however, the default maximum value is only 32,768 (that of a short int), although the value can optionally be increased to the full range afforded the type. The kernel stores this value as pid inside each process descriptor.

内核通过一个特殊的进程描述值或PID来描述一个进程。PID是一个整形数值，被表示为隐藏类型 pit_t。为了能与早期的Unix和Linux版本兼容，PID的默认最大值仅为32768（它是short型），虽然这个值可以增加到所允许的范围。内核将每个进程的PID存储在各自的进程描述符里。

An opaque type is a data type whose physical representation is unknown or irrelevant.

^[5] 隐含类型是种数据类型，它的物理表示不相关或者未知。

This maximum value is important because it is essentially the maximum number of processes that may exist concurrently on the system. Although 32,768 might be sufficient for a desktop system, large servers may require many more processes. The lower the value, the sooner the values will wrap around, destroying the useful notion that higher values indicate later run processes than lower values. If the system is willing to break compatibility with old applications, the administrator may increase the maximum value via /proc/sys/kernel/pid_max.

这个最大值非重要，因为它是系统中同时允许存在的最大的线程数。虽然32768对于桌面系统来说可以已经足够，但是大型的服务器来说，还是需要更多的线程。这个值越小，转一圈的速度就越快，破坏了大值的进程比小值的进程运行的快这一原理。如果不考虑与旧的版本兼容，管理员可以增大这个值，在/proc/sys/kernel/pid_max中修改它。

Inside the kernel, tasks are typically referenced directly by a pointer to their task_struct structure. In fact, most kernel code that deals with processes works directly with struct task_struct. Consequently, it is very useful to be able to quickly look up the process descriptor of the currently executing task, which is done via the current macro. This macro must be separately implemented by each architecture. Some architectures save a pointer to the task_struct structure of the currently running process in a register, allowing for efficient access. Other architectures, such as x86 (which has few registers to waste), make use of the fact that struct thread_info is stored on the kernel stack to calculate the location of thread_info and subsequently the task_struct.

在内核里，访问任务通常需要通过指向其结构体task_sturct的指针。实际上，大多数处理内核中处理进程的代码都和结构体task_struct相关。因此，这个非常有用，能够找到当前正在执行任务的进程描述符，而这个是通过当前宏完成的。体系结构不一样，该宏的实现也不一样。一些体系保存指向当前正在运行进程的结构体task_struct的指针在寄存器里，这样允许被充分的访问。其它一些体系，诸如X86（有很少的寄存器可以被使用），利用存储在内核栈的结构体thread_info来计算thread_info和task_struct的地方

On x86, current is calculated by masking out the 13 least significant bits of the stack pointer to obtain the thread_info structure. This is done by the current_thread_info() function. The assembly is shown here:

在X86体系中，current被计算用来屏蔽至少栈指针的有效位，来获得thread_info结构体。这个是通过函数curent_thread_info()完成的。汇编语言如下：

movl $-8192, %eax

andl %esp, %eax

This assumes that the stack size is 8KB. When 4KB stacks are enabled, 4096 is used in lieu of 8192.

这里假设栈指针的大小是8KB。当4KB的栈被用掉，是4096而不是8192。

Finally, current dereferences the task member of thread_info to return the task_struct:

最后，current从thread_info的task成员中提取再返回task_struct。

current_thread_info()->task;

Contrast this approach with that taken by PowerPC (IBM's modern RISC-based microprocessor), which stores the current task_struct in a register. Thus, current on PPC merely returns the value stored in the register r2. PPC can take this approach because, unlike x86, it has plenty of registers. Because accessing the process descriptor is a common and important job, the PPC kernel developers deem using a register worthy for the task.

对比下这部分在PowerPC上的实现，可以发现当前task_struct的值是存储在寄存器里面。这样，PPC仅仅返回存储在寄存器r2中的值。PPC能够获取这个值，因为它不同于X86，它有足够多的寄存器。因为访问进程描述符是一个很常用且很重要的工作，所以PPC的内核开发者认为使用寄存器处理任务是值得的。

Process State

进程状态

The state field of the process descriptor describes the current condition of the process (see Figure 3.3). Each process on the system is in exactly one of five different states. This value is represented by one of five flags:

进程描述符状态域描述了进程当前的情况。准切的说，内核的每个进程都有五种状态。通过下面的flags来描述这个值：

· TASK_RUNNING The process is runnable; it is either currently running or on a runqueue waiting to run (runqueues are discussed in Chapter 4, "Scheduling"). This is the only possible state for a process executing in user-space; it can also apply to a process in kernel-space that is actively running.

· TASK_RUNNING 进程是可运行态；正在运行或者排队等待运行（等待运行在第4章讨论）。这是唯一的在用户空间执行的进程；也能应用到正在运行的内核空间的进程。

· TASK_INTERRUPTIBLE The process is sleeping (that is, it is blocked), waiting for some condition to exist. When this condition exists, the kernel sets the process's state to TASK_RUNNING. The process also awakes prematurely and becomes runnable if it receives a signal.

· TASK_INTERRUPTIBLE 进程正在睡眠（阻塞）着等待进程满足一定的后，就退出。当这个进程退出后，内核设置进程的状态为TASK_RUNNING。如果收到信号变量，这个进程也可以的前的被唤醒，进入可运行态。

· TASK_UNINTERRUPTIBLE This state is identical to TASK_INTERRUPTIBLE except that it does not wake up and become runnable if it receives a signal. This is used in situations where the process must wait without interruption or when the event is expected to occur quite quickly. Because the task does not respond to signals in this state, TASK_UNINTERRUPTIBLE is less often used than TASK_INTERRUPTIBLE^[6].

· TASK_UNINTERRUPTIBLE这个状态除了不能被信号唤醒进入运行态以外，其它的都和TASK_INTERRUPTIBLE一样。这个使用在当进程没有中断的情况下，一定要等待或者事件发生很快的情况下。因为在这种情况下，任务不会响应信号。TASK_UNINTERRUPTIBLE没有TASK_INTERRUPTIBLE使用的频繁。

This is why you have those dreaded unkillable processes with state D in ps(1). Because the task will not respond to signals, you cannot send it a SIGKILL signal. Further, even if you could terminate the task, it would not be wise as the task is supposedly in the middle of an important operation and may hold a semaphore.

^[6] 这就是你会害怕被kill不掉的进程在ps（1）。因为这个任务不会响应这个信号，你不能通过发送SIGKILL来终止它。此外，即使终止了这个信号，在一个很重要的操作中，这也不是一个很明智的选择，并且你会收到一个信号。

· TASK_ZOMBIE The task has terminated, but its parent has not yet issued a wait4() system call. The task's process descriptor must remain in case the parent wants to access it. If the parent calls wait4(), the process descriptor is deallocated.

· TASK_ZOMBIE 状态已经被终止，但是它的父进程没有释放wait4()系统调用。一旦父进程想访问它时，就会发现任务的进程描述符还是存在。如果父进程调用了wait4(),进程描述符将会重新分配。

· TASK_STOPPED Process execution has stopped; the task is not running nor is it eligible to run. This occurs if the task receives the SIGSTOP, SIGTSTP, SIGTTIN, or SIGTTOU signal or if it receives any signal while it is being debugged.

· TASK_STOPPED 进程的执行已经被终止；任务没有正在被执行或者有运行的资格。如果收到SIGSTOP, SIGTSTP, SIGTTIN，IGTTOU或者正在被测试收到信号变量，将会到达这种状态。

Figure 3.3. Flow chart of process states.

Manipulating the Current Process State

操作当前的进程状态

Kernel code often needs to change a process's state. The preferred mechanism is using

内核代码经常需要改变处理器的状态，首选的机制如下：

set_task_state(task, state);        /* set task 'task' to state 'state' */

This function sets the given task to the given state. If applicable, it also provides a memory barrier to force ordering on other processors (this is only needed on SMP systems). Otherwise, it is equivalent to

函数为给定的任务设定状态。如果可以运用，将提供一个内存关卡，来屏蔽其它的处理器（这个仅限于多核系统）。否则，这个与下面的相等了。

task->state = state;

The method set_current_state(state) is synonymous to set_task_state(current, state).

Process Context

进程上下文

One of the most important parts of a process is the executing program code. This code is read in from an executable file and executed within the program's address space. Normal program execution occurs in user-space. When a program executes a system call (see Chapter 5, "System Calls") or triggers an exception, it enters kernel-space. At this point, the kernel is said to be "executing on be half of the process" and is in process context. When in process context, the current macro is valid^[7]. Upon exiting the kernel, the process resumes execution in user-space, unless a higher-priority process has become runnable in the interim, in which case the scheduler is invoked to select the higher priority process.

正在执行的代码是进程非常重要的一部分。这些代码从可执行的文件中被读走，且在程序的地址空间中执行。普通程序的执行发生用户空间。当一个程序执行系统调用或者发生异常时，程序将陷入内核空间。在这里，内核被认为是进程执行的一半，并且在进程的上下文。在进程上下文中，当前的宏是无效的。退出内核之后，唤醒的内核在用户空间执行，除非有更高优先权的进程在内部变得可执行，这样调度程序将被唤醒来执行它。

Other than process context there is interrupt context, which we discuss in Chapter 6, "Interrupts and Interrupt Handlers." In interrupt context, the system is not running on behalf of a process, but is executing an interrupt handler. There is no process tied to interrupt handlers and consequently no process context.

^[7] 与进程上下文对应的是中断上下文，将在第6章中断和中断处理中讨论。中为断上下文，系统不会运行在半进程，但是正在执行一个中断处理。没有中断函数涉入中断处理，因些没有里程上下文。

System calls and exception handlers are well-defined interfaces into the kernel. A process can begin executing in kernel-space only through one of these interfaces all access to the kernel is through these interfaces.

在内核中，系统调用和异常处理是容易定义的接口。在内核空间进程的开始仅仅是通过这当中的一个接口开始的，所有访问内核都是通过这样的接口进行的。

The Process Family Tree

进程家族树

A distinct hierarchy exists between processes in Unix systems, and Linux is no exception. All processes are descendents of the init process, whose PID is one. The kernel starts init in the last step of the boot process. The init process, in turn, reads the system initscripts and executes more programs, eventually completing the boot process.

Unix系统中的进程存在一个明显的继承关系，Linux系统也不例外。所有的进程都是PID为1的进程的后代。内核在系统起动的最后阶段启动进程。该进程读取系统的初始化脚本，执行更多的程序，最后完成系统启动过程。

Every process on the system has exactly one parent. Likewise, every process has zero or more children. Processes that are all direct children of the same parent are called siblings. The relationship between processes is stored in the process descriptor. Each task_struct has a pointer to the parent's task_struct, named parent, and a list of children, named children. Consequently, given the current process, it is possible to obtain the process descriptor of its parent with the following code:

系统中的每一个进程都有一个父进程。相应地，每个进程有0个或多个子进程。拥有同一个父进程的子进程称为兄弟。进程间的关系存储在进程描述符中。每个结构体task_struct有一个指向父进程的结构体task_struck,叫parent；并且有一系列的子进程，叫children。因此，给定一个当前的进程，通过下面的代码可以得到父进程的进程描述符。

struct task_struct *my_parent = current->parent;

Similarly, it is possible to iterate over a process's children with

同样也可以接下面的方式访问子进程。

struct task_struct *task;

struct list_head *list;

list_for_each(list, ¤t->children) {

        task = list_entry(list, struct task_struct, sibling);

        /* task now points to one of current's children */

The init task's process descriptor is statically allocated as init_task. A good example of the relationship between all processes is the fact that this code will always succeed:

初始进程的进程描述符是做为init_task动态分配的。下面的代码可以很好的演示进程之间的关系。

struct task_struct *task;

for (task = current; task != &init_task; task = task->parent)

/* task now points to init */

In fact, you can follow the process hierarchy from any one process in the system to any other. Oftentimes, however, it is desirable simply to iterate over all processes in the system. This is easy because the task list is a circular doubly linked list. To obtain the next task in the list, given any valid task, use:

实际上，通过这种继承关系，可以通过任意一个进程访问另一个任意进程。但是大多数时候，只要通过简单重复的方式，都可以访问系统中的所有的进程。这是因为进程队列都是一个简单的双向链表。对于给定的任向进程，获得链表中下一个进程：

list_entry(task->tasks.next, struct task_struct, tasks)

Obtaining the previous works the same way:

获取前一个进程的方法相同：

list_entry(task->tasks.prev, struct task_struct, tasks)

These two routines are provided by the macros next_task(task) and prev_task(task), respectively. Finally, the macro for_each_process(task) is provided, which iterates over the entire task list. On each iteration, task points to the next task in the list:

这两个例程分别是通过宏next_task和宏prev_task实现的。实际上，宏for_each_process提供了访问整个链表的能力。每次访问，进程指向链表的下一个进程：

struct task_struct *task;

for_each_process(task) {

        /* this pointlessly prints the name and PID of each task */

        printk("%s[%d]\n", task->comm, task->pid);

Note: It can be expensive to iterate over every task in a system with many processes; code should have good reason (and no alternative) before doing so.

注意：在拥有很多进程的系统中访问每个进程是要花费很多代价，除非在做之前，你有充足的理由，否则别这样做。

Process Creation

进程创建

Process creation in Unix is unique. Most operating systems implement a spawn mechanism to create a new process in a new address space, read in an executable, and begin executing it. Unix takes the unusual approach of separating these steps into two distinct functions: fork() and exec()^[8]. The first, fork(), creates a child process that is a copy of the current task. It differs from the parent only in its PID (which is unique), its PPID (parent's PID, which is set to the original process), and certain resources and statistics, such as pending signals, which are not inherited. The second function, exec(), loads a new executable into the address space and begins executing it. The combination of fork() followed by exec() is similar to the single function most operating systems provide.

Unix的进程创建是很特别的。许多操作系统提供产生（spawn）进程机制来产生进程，在新的地址空间，读入可执行文件，开始执行。Unix将这个与众不同的进程创建分为两个不同的函数去执行：fork()和exec()。首先，fork()函数拷贝当前进程来创建一个子进程。它和父进程不同的仅仅是PID、它的PPID（父进程的PID，是属于源进程的）、资源和统计量，如挂起的信号，是不被继承的。第二个函数的功能：exec()在地址空间中载入可执行文件并开始执行它。fork()和exec()的组合象大多数的操作系统的一个函数创建一样。

By exec() I mean any member of the exec() family of functions. The kernel implements the execve() system call on top of which execlp(),execle(),execv() , and execvp() are implemented.

exec()在这里指所有exec()一族函数。内核系统实现了execve()、execlp(),execle(),execv() , and execvp()等各类系统的调用。

Copy-on-Write

写时拷贝

Traditionally, upon fork() all resources owned by the parent are duplicated and the copy is given to the child. This approach is significantly naïve and inefficient in that it copies much data that might otherwise be shared. Worse still, if the new process were to immediately execute a new image, all that copying would go to waste. In Linux, fork() is implemented through the use of copy-on-write pages. Copy-on-write (or COW) is a technique to delay or altogether prevent copying of the data. Rather than duplicate the process address space, the parent and the child can share a single copy. The data, however, is marked in such a way that if it is written to, a duplicate is made and each process receives a unique copy. Consequently, the duplication of resources occurs only when they are written; until then, they are shared read-only. This technique delays the copying of each page in the address space until it is actually written to. In the case that the pages are never written for example, if exec() is called immediately after fork()they never need to be copied. The only overhead incurred by fork() is the duplication of the parent's page tables and the creation of a unique process descriptor for the child. In the common case that a process executes a new executable image immediately after forking, this optimization prevents the wasted copying of large amounts of data (with the address space, easily tens of megabytes). This is an important optimization because the Unix philosophy encourages quick process execution.

一般来说，所有属于父进程的资源，在fork()后将被复制给子进程。这种复制在很大程度上效率是很低的。更坏的是，如果新的进程立即执行了一个映象，所有复制的内容都会被浪费掉。在Linux系统中，fork()的实现是通过copy-on-write页实现的。写时拷贝是延时或阻止数据的拷贝。与复制进程地址空间相比，父进程和子进程能共享一个单独的拷贝。然而，这些数据被这样标识：如果有数据将要被写入时，复制将产生，并且每个进程将会获得一个不同的拷贝。所以资源的拷贝仅仅发生在数据写入时。直到此时，只是以只读方式共享。这种技术在地址空间里延迟页面的拷贝，直到有真正有写入。例如，页面从来没有拷贝，在fork()后面的exec()立即被调用时，从不需要复制。fork()函数唯一开销是复制父进程页面表和为子进程创建一个不同的进程描述符。在一般的情况下，fork之后，一个新的可执行映象立即被可执行，这个优化阻止浪费大量数据（地址空间里常常包含数据约10M的数据）。Unix强调进程快速创建，这点很重要。

`fork()`

Linux implements fork() via the clone() system call. This call takes a series of flags that specify which resources, if any, the parent and child process should share (see the section on "The Linux Implementation of Threads" later in this chapter for more about the flags). The fork(), vfork(), and __clone() library calls all invoke the clone() system call with the requisite flags. The clone() system call, in turn, calls do_fork()。

Linux系统通过clone()系统调用来实现fork()函数。这个系统调用将采用一系列的标志来指示父进程，子进程所要用到的系统资源。fork(),vfork(),__clone()都需要各自的标志参数来标志clone()。然后由系统调用clone()去调用do_fork()。

The bulk of the work in forking is handled by do_fork(), which is defined in kernel/fork.c. This function calls copy_process(), and then starts the process running. The interesting work is done by copy_process():

在创建中的大部分工作是do_fork()完成的，这个被定义在kernel/fork.c中完成的。这个函数调用copy_process()，就开始进程的运行。这个有趣的工作是通过copy_process()完成的。

· It calls dup_task_struct(), which creates a new kernel stack, thread_info structure, and task_struct for the new process. The new values are identical to those of the current task. At this point, the child and parent process descriptors are identical.

· 它调用dup_task_struct()为新的进程创建新的内核堆，结构体thread_info和task_struct。新的值和当前进程的值相同。在这个时候，子进程与父进程的值是完全一样的。

· It then checks that the new child will not exceed the resource limits on the number of processes for the current user.

· 检查新创建的子进程后，当前进程所拥用的进程数目没有超过给其它资源的限制。

· Now the child needs to differentiate itself from its parent. Various members of the process descriptor are cleared or set to initial values. Members of the process descriptor that are not inherited are primarily statistically information. The bulk of the data in the process descriptor is shared.

· 现在子进程要将自己与父进程进行区分。进程描述符的各成员被清除或者设置为初始值。进程描述符的成员是最原始的统计信息，没不继承而来的。进程描述符中的大部分数据是共享的。

· Next, the child's state is set to TASK_UNINTERRUPTIBLE, to ensure that it does not yet run.

· 接着，子进程的状态设置成TASK_UNINTERRUPTIBLE，确保仍然不能被运行。

· Now, copy_process() calls copy_flags() to update the flags member of the task_struct. The PF_SUPERPRIV flag, which denotes whether a task used super-user privileges, is cleared. The PF_FORKNOEXEC flag, which denotes a process that has not called exec(), is set.

· 现在，copy_process()调用copy_flag来更新task_struct的flgas成员。标志为是否有超级权限的PF_SUPERPRIV被清零。标志为还没有调用exec()的PF_FORKNOEXEC被设置。

· Next, it calls get_pid() to assign an available PID to the new task.

· 接着，调用get_pid()为新进程分配一个PID。

· Depending on the flags passed to clone(), copy_process() then either duplicates or shares open files, filesystem information, signal handlers, process address space, and namespace. These resources are typically shared between threads in a given process; otherwise they are unique and thus copied here.

· 根据传递给clone()的flag，copy_process()复制或共享打开文件，文件系统信息，信号句柄，进程地址空间和命名空间。一般情况下，这些资源会给指定进程的线程共享。否则这些资源对每个进程是不同的，因此要在这时在被复制。

· Next, the remaining timeslice between the parent and its child is split between the two (this is discussed in Chapter 4).

· 接着，让父进程与子进程平分剩余的时间片。

· Finally, copy_process() cleans up and returns to the caller a pointer to the new child.

· 最后，copy_process()作扫尾工作并且返回一个指向新子进程的指针。

Back in do_fork(), if copy_process() returns successfully, the new child is woken up and run. Deliberately, the kernel runs the child process first^[9]. In the common case of the child simply calling exec() immediately, this eliminates any copy-on-write overhead that would occur if the parent ran first and began writing to the address space.

再回到do_fork()函数，如果copy_process()返回成功，新的子进程将被唤醒并且运行。特别地，内核先运行子进程。因为一般子进程会立即调用exec()，这样可以避免copy-on-write的额外开销，如果父进程先执行，可能会向地址空间写入。

Amusingly, this does not currently function correctly, although the goal is for the child to run first.

有趣的是，虽然想让子进程先运行，但并非总是如此。

`vfork()`

The vfork() system call has the same effect as fork(), except that the page table entries of the parent process are not copied. Instead, the child executes as the sole thread in the parent's address space, and the parent is blocked until the child either calls exec() or exits. The child is not allowed to write to the address space. This was a welcome optimization in the old days of 3BSD when the call was introduced because at the time copy-on-write pages were not used to implement fork(). Today, with copy-on-write and child-runs-first semantics, the only benefit to vfork() is not copying the parent page tables entries. If Linux one day gains copy-on-write page table entries there will no longer be any benefit^[10]. Because the semantics of vfork() are tricky (what, for example, happens if the exec() fails?) it would be nice if vfork() died a slow painful death. It is entirely possible to implement vfork() as a normal fork()in fact, this is what Linux did until 2.2.

vfork()系统调用和fork()相似，除了父进程的页表项不被copy外。相反，子进程作为线程的角色，在父进程的地址空间里执行，并且父进程将阻塞直到调用用exec()或退出。子进程不能在地址空间里写入。这是一个非常好的优化，因为在过去的3BSD时代，写时拷贝没有用来实现fork()。现在由于copy-on-write和child-runs-first机制，对vfork()来说惟一的好处是不拷贝父进程的页表项。如果Linux有了写时拷贝，那么vfork()将来就没有用了。另外，由于linux的语意很微妙（例如，如果执行exec()时失败了怎么办？），让它淡淡离去是最好的。实际上，完全可以将实现vfork()作为普通的fork()来做，在Linux2.2以前是这么做的。

In fact, there are currently patches to add this functionality to Linux. In time, this feature will most likely find its way into the mainline Linux kernel.

实际上，现在已以有补丁要以帮Linux实现该功能。

The vfork() system call is implemented via a special flag to the clone() system call:

Vfork()系统调用是通过为系统用调用clone()来指定flag来实现的。

· In copy_process(), the task_struct member vfork_done is set to NULL.

· 在执行copy_process()中，结构体成员vfork_done被设置成NULL。

· In do_fork(), if the special flag was given, vfork_done is pointed at a specific address.

· 在执行do_fork()时，如果要得到一个特殊的标志，vfork_done()将被指向一个特殊的地址。

· After the child is first run, the parent instead of return ing waits for the child to signal it through the vfork_done pointer.

· 子进程运行后，父进程不是马上返回，而是一直等待，直到通过vfork_done()返回的指针。

· In the mm_release() function, which is used when a task exits a memory address space, vfork_done is checked to see whether it is NULL. If it is not, the parent is signaled.

· 在调用mm_release()时，这个函数用于是否有任务退出内存地址空间，并且检查vfork_done是否为空。如果不是，则向父进程发出信号。

· Back in do_fork(), the parent wakes up and returns.

· 回到do_fork(),父进程被唤醒并且返回。

If this all goes as planned, the child is now executing in a new address space and the parent is again executing in its original address space. The overhead is lower, but the design is not pretty.

如果所有的如想像的那样，子进程在新的地址空间里执行，父进程在自己的原地址空间里执行。开销是很低的，但是这样的设计并不完美。

The Linux Implementation of Threads

Linux线程的实现

Threads are a popular modern programming abstraction. They provide multiple threads of execution within the same program in a shared memory address space. They can also share open files and other resources. Threads allow for concurrent programming and, on multiple processor systems, true parallelism.

线程机制是流行的现代编程的抽象。该机制提供了同一程序在共享内存地址空间内运行的一组线程。它们也能共享打开的文件和其它资源。多线程允许并发程序，在多处理器上实现真正的并发。

Linux has a unique implementation of threads. To the Linux kernel, there is no concept of a thread. Linux implements all threads as standard processes. The Linux kernel does not provide any special scheduling semantics or data structures to represent threads. Instead, a thread is merely a process that shares certain resources with other processes. Each thread has a unique task_struct and appears to the kernel as a normal process (which just happens to share resources, such as an address space, with other processes).

Linux用特殊的方式实现线程。对于Linux内核来说，没有线程的概念。Linux按照标准进程来实现线程。Linux内核没有提供任何专门调度语义或者数据结构来表现线程。相反，一个线程几乎就是一个进程，它能和其它进程共享一定的资源。每个进程有一个属于自己的结构体，对内核来说它看起来像一个普通的线程。（只是该进程和其它一些进程共享一些资源，如地址空间。）

This approach to threads contrasts greatly with operating systems such as Microsoft Windows or Sun Solaris, which have explicit kernel support for threads (and sometimes call threads lightweight processes). The name "lightweight process" sums up the difference in philosophies between Linux and other systems. To these other operating systems, threads are an abstraction to provide a lighter, quicker execution unit than the heavy process. To Linux, threads are simply a manner of sharing resources between processes (which are already quite lightweight)^[11]. For example, assume you have a process that consists of four threads. On systems with explicit thread support, there might exist one process descriptor that in turn points to the four different threads. The process descriptor describes the shared resources, such as an address space or open files. The threads then describe the resources they alone possess. Conversely, in Linux, there are simply four processes and thus four normal task_struct structures. The four processes are set up to share certain resources.

As an example, benchmark process creation time in Linux versus process (or even thread!) creation time in these other operating systems. The results are quite nice.

Threads are created like normal tasks, with the exception that the clone() system call is passed flags corresponding to specific resources to be shared:

clone(CLONE_VM | CLONE_FS | CLONE_FILES | CLONE_SIGHAND, 0);

The previous code results in behavior identical to a normal fork(), except that the address space, filesystem resources, file descriptors, and signal handlers are shared. In other words, the new task and its parent are what are popularly called threads.

In contrast, a normal fork() can be implemented as

clone(SIGCHLD, 0);

And vfork() is implemented as

clone(CLONE_VFORK | CLONE_VM | SIGCHLD, 0);

The flags provided to clone() help specify the behavior of the new process and detail what resources the parent and child will share. Table 3.1 lists the clone flags, which are defined in , and their effect.

Table 3.1. `clone()` Flags
Flag	Meaning
`CLONE_FILES`	Parent and child share open files.
`CLONE_FS`	Parent and child share filesystem information.
`CLONE_IDLETASK`	Set PID to zero (used only by the idle tasks).
`CLONE_NEWNS`	Create a new namespace for the child.
`CLONE_PARENT`	Child is to have same parent as its parent.
`CLONE_PTRACE`	Continue tracing child.
`CLONE_SETTID`	Write the TID back to user-space.
`CLONE_SETTLS`	Create a new TLS for the child.
`CLONE_SIGHAND`	Parent and child share signal handlers and blocked signals.
`CLONE_SYSVSEM`	Parent and child share System V `SEM_UNDO` semantics.
`CLONE_THREAD`	Parent and child are in the same thread group.
`CLONE_VFORK`	`vfork()` was used and the parent will sleep until the child wakes it.
`CLONE_UNTRACED`	Do not let the tracing process force `CLONE_PTRACE` on the child.
`CLONE_STOP`	Start process in the `TASK_STOPPED` state.
`CLONE_SETTLS`	Create a new TLS (thread-local storage) for the child.
`CLONE_CHILD_CLEARTID`	Clear the TID in the child.
`CLONE_CHILD_SETTID`	Set the TID in the child.
`CLONE_PARENT_SETTID`	Set the TID in the parent.
`CLONE_VM`	Parent and child share address space.

Kernel Threads

It is often useful for the kernel to perform some operations in the background. The kernel accomplishes this via kernel threadsstandard processes that exist solely in kernel-space. The significant difference between kernel threads and normal processes is that kernel threads do not have an address space (in fact, their mm pointer is NULL). They operate only in kernel-space and do not context switch into user-space. Kernel threads are, however, schedulable and preemptable as normal processes.

Linux delegates several tasks to kernel threads, most notably the pdflush task and the ksoftirqd task. These threads are created on system boot by other kernel threads. Indeed, a kernel thread can be created only by another kernel thread. The interface for spawning a new kernel thread from an existing one is

int kernel_thread(int (*fn)(void *), void * arg, unsigned long flags)

The new task is created via the usual clone() system call with the specified flags argument. On return, the parent kernel thread exits with a pointer to the child's task_struct. The child executes the function specified by fn with the given argument arg. A special clone flag, CLONE_KERNEL, specifies the usual flags for kernel threads: CLONE_FS, CLONE_FILES, and CLONE_SIGHAND. Most kernel threads pass this for their flags parameter.

Typically, a kernel thread continues executing its initial function forever (or at least until the system reboots, but with Linux you never know). The initial function usually implements a loop in which the kernel thread wakes up as needed, performs its duties, and then returns to sleep.

We will discuss specific kernel threads in more detail in later chapters.

Process Termination

It is sad, but eventually processes must die. When a process terminates, the kernel releases the resources owned by the process and notifies the child's parent of its unfortunate demise.

Typically, process destruction occurs when the process calls the exit() system call, either explicitly when it is ready to terminate or implicitly on return from the main subroutine of any program (that is, the C compiler places a call to exit() after main() returns). A process can also terminate involuntarily. This occurs when the process receives a signal or exception it cannot handle or ignore. Regardless of how a process terminates, the bulk of the work is handled by do_exit(), which completes a number of chores:

· First, it set the PF_EXITING flag in the flags member of the task_struct.

· Second, it calls del_timer_sync() to remove any kernel timers. Upon return, it is guaranteed that no timer is queued and that no timer handler is running.

· Next, if BSD process accounting is enabled, do_exit() calls acct_process() to write out accounting information.

· Now it calls __exit_mm() to release the mm_struct held by this process. If no other process is using this address space (in other words, if it is not shared), then deallocate it.

· Next, it calls exit_sem(). If the process is queued waiting for an IPC semaphore, it is dequeued here.

· It then calls __exit_files(), __exit_fs(), exit_namespace(), and exit_sighand() to decrement the usage count of objects related to file descriptors, filesystem data, the process namespace, and signal handlers, respectively. If any usage counts reach zero, the object is no longer in use by any process and it is removed.

· Subsequently, it sets the task's exit code, stored in the exit_code member of the task_struct, to the code provided by exit() or whatever kernel mechanism forced the termination. The exit code is stored here for optional retrieval by the parent.

· It then calls exit_notify() to send signals to the task's parent, reparents any of the task's children to another thread in their thread group or the init process, and sets the task's state to TASK_ZOMBIE.

· Finally, do_exit() calls schedule() to switch to a new process (see Chapter 4). Because TASK_ZOMBIE tasks are never scheduled, this is the last code the task will ever execute.

The code for do_exit() is defined in kernel/exit.c.

At this point, all objects associated with the task (assuming the task was the sole user) are freed. The task is not runnable (and in fact no longer has an address space in which to run) and is in the TASK_ZOMBIE state. The only memory it occupies is its kernel stack, the thread_info structure, and the task_struct structure. The task exists solely to provide information to its parent. After the parent retrieves the information, or notifies the kernel that it is uninterested, the remaining memory held by the process is freed and returned to the system for use.

Removal of the Process Descriptor

After do_exit() completes, the process descriptor for the terminated process still exists but the process is a zombie and is unable to run. As discussed, this allows the system to obtain information about a child process after it has terminated. Consequently, the acts of cleaning up after a process and removing its process descriptor are separate. After the parent has obtained information on its terminated child, or signified to the kernel that it does not care, the child's task_struct is deallocated.

The wait() family of functions are implemented via a single (and complicated) system call, wait4(). The standard behavior is to suspend execution of the calling task until one of its children exits, at which time the function returns with the PID of the exited child. Additionally, a pointer is provided to the function that on return holds the exit code of the terminated child.

When it is time to finally deallocate the process descriptor, release_task() is invoked. It does the following:

· First, it calls free_uid() to decrement the usage count of the process's user. Linux keeps a per-user cache of information related to how many processes and files a user has opened. If the usage count reaches zero, the user has no more open processes or files and the cache is destroyed.

· Second, release_task() calls unhash_process() to remove the process from the pidhash and remove the process from the task list.

· Next, if the task was ptraced, release_task() reparents the task to its original parent and removes it from the ptrace list.

· Ultimately, release_task(), calls put_task_struct() to free the pages containing the process's kernel stack and thread_info structure and deallocate the slab cache containing the task_struct.

At this point, the process descriptor and all resources belonging solely to the process have been freed.

The Dilemma of the Parentless Task

If a parent exits before its children, some mechanism must exist to reparent the child tasks to a new process, or else parentless terminated processes would forever remain zombies, wasting system memory. The solution, hinted upon previously, is to reparent a task's children on exit to either another process in the current thread group or, if that fails, the init process. In do_exit(), notify_parent() is invoked, which calls forget_original_parent() to perform the reparenting:

struct task_struct *p, *reaper = father;

struct list_head *list;

if (father->exit_signal != -1)

        reaper = prev_thread(reaper);

else

        reaper = child_reaper;

if (reaper == father)

        reaper = child_reaper;

This code sets reaper to another task in the process's thread group. If there is not another task in the thread group, it sets reaper to child_reaper, which is the init process. Now that a suitable new parent for the children is found, each child needs to be located and reparented to reaper:

list_for_each(list, &father->children) {

        p = list_entry(list, struct task_struct, sibling);

        reparent_thread(p, reaper, child_reaper);

list_for_each(list, &father->ptrace_children) {

        p = list_entry(list, struct task_struct, ptrace_list);

        reparent_thread(p, reaper, child_reaper);

This code iterates over two lists: the child list and the ptraced child list, reparenting each child. The rationale behind having both lists is interesting; it is a new feature in the 2.6 kernel. When a task is ptraced, it is temporarily reparented to the debugging process. When the task's parent exits, however, it must be reparented along with its other siblings. In previous kernels, this resulted in a loop over every process in the system looking for children. The solution, as noted previously, is simply to keep a separate list of a process's children that are being ptracedreducing the search for one's children from every process to just two relatively small lists.

With the process successfully reparented, there is no risk of stray zombie processes. The init process routinely calls wait() on its children, cleaning up any zombies assigned to it.

Process Wrap Up

In this chapter, we looked at the famed operating system abstraction of the process. We discussed the generalities of the process, why it is important, and the relationship between processes and threads. We then discussed how Linux stores and represents processes (with task_struct and thread_info), how processes are created (via clone() and fork()), how new executable images are loaded into address spaces (via the exec() family of system calls), the hierarchy of processes, how parents glean information about their deceased children (via the wait() family of system calls), and how processes ultimately die (forcefully or intentionally via exit()).

The process is a fundamental and crucial abstraction, at the heart of every modern operating system, and ultimately the reason we have operating systems altogether (to run programs).

The next chapter discusses process scheduling, which is the delicate and interesting manner in which the kernel decides which processes to run, at what time, and in what order.

阅读(2366) | 评论(0) | 转发(0) |

上一篇：aa

下一篇：Chapter 4. Process Scheduling

给主人留下些什么吧！~~

感谢所有关心和支持过ChinaUnix的朋友们

16024965号-6