linux启动过程分析（转）-renyuan000-ChinaUnix博客

renyuan000

首页　| 　博文目录　| 　关于我

renyuan000

博客访问： 616329
博文数量： 353
博客积分： 1104
博客等级：少尉
技术积分： 1457
用户组：普通用户
注册时间： 2008-12-23 23:02

个人简介

1、刚工作时做Linux 流控；后来做安全操作系统；再后来做操作系统加固；现在做TCP 加速。唉！没离开过类Unix！！！但是水平有限。。

文章分类

全部博文（353）

文章存档

2015年（80）

2013年（4）

2012年（90）

2011年（177）

2010年（1）

2009年（1）

我的朋友

相关博文

linux启动过程分析（转）

分类： LINUX

2011-12-19 15:39:39

http://bbs.chinaunix.net/thread-2087843-1-1.html

linux需要启动过程我觉得很复杂，查了很多资料，还不是特别清楚，所以我就结合各种资料看了一下linux/arch/i386下的一点代码，现贴出一点解释，希望能有帮助
因为基本上都是自己的理解，所以可能有不正确的地方，欢迎指出。

一. 首先先看看/boot/vmlinuz-2.6.23
这个是内核文件，启动系统的时候bootloader会把它加载到内存，然后执行它。虽然这是众所周知的，但是其实过程比较复杂。linux使用很多bootload,在多种体系结构上跑，因为我看的代码是i386下的，所以这里就说grub了。复杂的地方很多，比如grub怎么能读文件，把它加载到哪，怎么执行内核。至于第一个问题，它能根据grub第一块512字节中的数据（bootsect）来判断第二步骤代码（512字节显然不能识别文件系统）在磁盘中的什么块，这些数据是grub安装的时候记录的，然后执行第二步骤，所以就能识别文件系统了。
现在就假设grub已经能识别文件系统了,接下来
首先grub拿到vmlinuz（以后就简称这个了），怎么处理它呢，这就要协议，(详细信息参考linux-2.6.23/Documentation/i386/boot.txt). 先来看看vmlinuz由什么构成：
第一个512字节
第二个一段代码，若干不多个512字节（一会再说它多大）
保护模式下的内核代码
第一个512就是通常的启动扇区，对应于ULK3的远古时代(但是它有点特殊，因为它现在并不用作启动扇区，一会儿会看到),以前是在arch/i386/boot/bootsect.S中，但是现在看看代码就知道，它和第二段代码(的部分)合并到arch/i386/boot/header.S中了。
第二段是实模式下的setup代码，对于ULK3中说的中世纪。如上所述，现在的版本setup部分内容和第一部分合并为header.S，而setup代码的其它内容来自boot目录下的其它一些源文件（参考boot/Makefile），这段代码的大小，在连接setup的时候会得到，然后这个数据会写入第一个512字节中的偏移（也是整个vmlinuz的偏移）位置为0x01F1处的一个字节。它的值表示第一段＋第二段所占据的512字节的个数，比如
xuchm@debian:~/doc$ od -j 0x01F1 -N1 /boot/vmlinuz-2.6.23 -D
0000761       21
0000762
表示21＊512 == (1+20)*512，就说明第二段的长度为20*512 == 10K的大小
第三段就是所有的内核的其它代码构成的，因为第三部分代码是进入保护模式后执行的，所以和setup相反，被称为保护模式的代码。
有点要注意的是，我这里说的是没有压缩的内核。对于ULK3中的解压就没了。
好了，在对内核文件的构成有个了解之后来看看加载过程

先看一段文字(Documentation/i386/boot.txt)
For a modern bzImage kernel with boot protocol version >= 2.02, a
memory layout like the following is suggested:

~                      ~
      |  Protected-mode kernel |#这是上面说的保护模式代码，grub会把它放到这里
100000  +------------------------+
|  I/O memory hole    |
0A0000 +------------------------+
|  Reserved for BIOS    | Leave as much as possible unused
~                      ~
|  Command line    | (Can also be below the X+10000 mark)
X+10000 +------------------------+
|  Stack/heap    | For use by the kernel real-mode code.
X+08000 +------------------------+
|  Kernel setup    | The kernel real-mode code.
|  Kernel boot sector    | The kernel legacy boot sector.
#上面这两个就是vmlinuz的前面两部分的代码和数据
X    +------------------------+
|  Boot loader    | <- Boot sector entry point 0000:7C00
001000 +------------------------+
|  Reserved for MBR/BIOS |
000800 +------------------------+
|  Typically used by MBR |
000600 +------------------------+
|  BIOS use only    |
000000 +------------------------+

... where the address X is as low as the design of the boot loader
permits.

可以看到，vmlinuz的前两部分在一起，而具体在哪并不是需要固定的，要看grub本身的大小(我把这两部分叫实模式下的代码，虽然上面这段文字说的只是setup部分)，而第三部分在0x100000(是不是在这里这由0x0211字节的内容提示，下面也会说到)

二. 汇编代码header.S
如上所述第一和第二段（部分）代码在boot/header.S中。所以就一起看看，为了完整，都贴出来了，所以占用了不少位置，我的注释都是中文
再从加载开始说，如上所述，vmlinuz保护模式的代码加载到0x100000开始的位置。而实模式的代码（再一次，指bootsect和setup两部分），因为被加载的位置不要求是固定的，也就是上面说到的
Kernel setup    | The kernel real-mode code.
Kernel boot sector
它们的位置会受grub大小的影响。
所以不妨假设被加载到的逻辑地址为 __LOAD_DS__:0000；
接下来看代码
/*
* header.S
*
* Copyright (C) 1991, 1992 Linus Torvalds
*
* Based on bootsect.S and setup.S
* modified by more people than can be counted
*
* Rewritten as a common file by H. Peter Anvin (Apr 2007)
*
* BIG FAT NOTE: We're in real mode using 64k segments.  Therefore segment
* addresses must be multiplied by 16 to obtain their respective linear
* addresses. To avoid confusion, linear addresses are written using leading
* hex while segment addresses are written as segment:offset.
*
*/

#include
#include
#include
#include
#include
#include
#include "boot.h"

SETUPSECTS = 4 /* default nr of setup-sectors */
BOOTSEG = 0x07C0 /* original address of boot-sector */
SYSSEG = DEF_SYSSEG /* system loaded at 0x10000 (65536) */
SYSSIZE = DEF_SYSSIZE /* system size: # of 16-byte clicks */
/* to be loaded */
ROOT_DEV = 0 /* ROOT_DEV is now written by "build" */
SWAP_DEV = 0 /* SWAP_DEV is now written by "build" */

#ifndef SVGA_MODE
#define SVGA_MODE ASK_VGA
#endif

#ifndef RAMDISK
#define RAMDISK 0
#endif

#ifndef ROOT_RDONLY
#define ROOT_RDONLY 2
#endif

.code16
.section ".bstext", "ax" #注意这个节的名称

.global bootsect_start
bootsect_start:

#这开始是bootsect代码，也就是vmlinuz第一个512字节的源代码
# Normalize the start address
ljmp $BOOTSEG, $start2

＃ 0x07C0：0000 如果从这里开始执行，那么说明是被BIOS直接加载过来的，这是不允许的，因为现在linux需要一个bootloader，这也就是上面说的bootsect有点特殊的地方，就是说它并没打算用来执行。所以万一它被作为bootsect由BIOS直接执行，那么就直接提示reboot. 可以拿vmware实验一下
dd if=/boot/vmlinuz-2.6.23 of=vm.img bs=512 count=1
然后用vm.img作为软盘启动。就会看到下面这段提示信息。
这段代码的作用就是打印消息并等待重启，就不多说了

start2:
movw %cs, %ax
movw %ax, %ds
movw %ax, %es
movw %ax, %ss
xorw %sp, %sp
sti
cld

movw $bugger_off_msg, %si

msg_loop:
lodsb
andb %al, %al
jz bs_die
movb $0xe, %ah
movw $7, %bx
int $0x10
jmp msg_loop

bs_die:
# Allow the user to press a key, then reboot
xorw %ax, %ax
int $0x16
int $0x19

# int 0x19 should never return.  In case it does anyway,
# invoke the BIOS reset code...
ljmp $0xf000,$0xfff0

.section ".bsdata", "a" ＃注意节
bugger_off_msg:
.ascii "Direct booting from floppy is no longer supported.\r\n"
.ascii "Please use a boot loader program instead.\r\n"
.ascii "\n"
.ascii "Remove disk and press any key to reboot . . .\r\n"
.byte 0

# Kernel attributes; used by setup.  This is part 1 of the
# header, from the old boot sector.
#这也是vmlinuz前512字节的内容，只是它是数据

.section ".header", "a" #注意节的名字
.globl hdr
hdr:
setup_sects: .byte SETUPSECTS
root_flags: .word ROOT_RDONLY
syssize: .long SYSSIZE
ram_size: .word RAMDISK
vid_mode: .word SVGA_MODE
root_dev: .word ROOT_DEV
boot_flag: .word 0xAA55 ＃熟悉吧

＃以上定义了3个节，.bstext,.bsdata,.header，这3个节共同构成了上面说的vmlinuz的第一个512字节，接下来就是中世纪的代码了，也是正常情况下内核接管bootloader执行的*第一条代码*所在地

上面说到，grub把实模式代码加载到__LOAD_DS__:0000,那么grub怎么执行这条代码呢

jmp_far(__LOAD_DS__+0x20, 0); /* Run the kernel */ 0x20加在段上就是0x200个字节（实模式下逻辑地址到线性地址的计算方法seg<<4+offset），也就是跳到vmlinuz实模式代码setup执行了，因为俄这样就略过了512字节的bootsect.
这实际上就是grub中一条指令,ljmp
好了，执行流到了setup代码了，也就是vmlinuz第二个512字节处。也就是_start

# offset 512, entry point

.globl _start
_start:
# Explicitly enter this as bytes, or the assembler
# tries to generate a 3-byte jump here, which causes
# everything else to push off to the wrong offset.
.byte 0xeb # short (2-byte) jump
.byte start_of_setup-1f
#第一条指令，这是一条短跳转，跳过了一些数据，由于以后会再提及这些数据，所以现在接着看start_of_setup:
1:

# Part 2 of the header, from the old setup.S

.ascii "HdrS" # header signature
.word 0x0206 # header version number (>= 0x0105)
# or else old loadlin-1.5 will fail)
.globl realmode_swtch
realmode_swtch: .word 0, 0 # default_switch, SETUPSEG
start_sys_seg: .word SYSSEG
.word kernel_version-512 # pointing to kernel version string
# above section of header is compatible
# with loadlin-1.5 (header v1.5). Don't
# change it.

type_of_loader: .byte 0 # = 0, old one (LILO, Loadlin,
#    Bootlin, SYSLX, bootsect...)
# See Documentation/i386/boot.txt for
# assigned ids

# flags, unused bits must be zero (RFU) bit within loadflags
loadflags:
LOADED_HIGH = 1 # If set, the kernel is loaded high
CAN_USE_HEAP = 0x80 # If set, the loader also has set
# heap_end_ptr to tell how much
# space behind setup.S can be used for
# heap purposes.
# Only the loader knows what is free
#ifndef __BIG_KERNEL__
.byte 0
#else
.byte LOADED_HIGH
#endif

setup_move_size: .word  0x8000 # size to move, when setup is not
# loaded at 0x90000. We will move setup
# to 0x90000 then just before jumping
# into the kernel. However, only the
# loader knows how much data behind
# us also needs to be loaded.

code32_start: # here loaders can put a different
# start address for 32-bit code.
#ifndef __BIG_KERNEL__
.long 0x1000 # 0x1000 = default for zImage
#else
.long 0x100000 # 0x100000 = default for big kernel
＃指示装载保护模式代码到0x100000(1M后开始)
#endif

ramdisk_image: .long 0 # address of loaded ramdisk image
# Here the loader puts the 32-bit
# address where it loaded the image.
# This only will be read by the kernel.

ramdisk_size: .long 0 # its size in bytes

bootsect_kludge:
.long 0 # obsolete

heap_end_ptr: .word _end+1024 # (Header version 0x0201 or later)
# space from here (exclusive) down to
# end of setup code can be used by setup
# for local heap purposes.

pad1: .word 0
cmd_line_ptr: .long 0 # (Header version 0x0202 or later)
# If nonzero, a 32-bit pointer
# to the kernel command line.
# The command line should be
# located between the start of
# setup and the end of low
# memory (0xa0000), or it may
# get overwritten before it
# gets read.  If this field is
# used, there is no longer
# anything magical about the
# 0x90000 segment; the setup
# can be located anywhere in
# low memory 0x10000 or higher.

ramdisk_max: .long (-__PAGE_OFFSET-(512 << 20)-1) & 0x7fffffff
# (Header version 0x0203 or later)
# The highest safe address for
# the contents of an initrd

kernel_alignment:  .long CONFIG_PHYSICAL_ALIGN #physical addr alignment
#required for protected mode
#kernel
#ifdef CONFIG_RELOCATABLE
relocatable_kernel: .byte 1
#else
relocatable_kernel: .byte 0
#endif
pad2: .byte 0
pad3: .word 0

cmdline_size: .long COMMAND_LINE_SIZE-1    #length of the command line,
                                             #added with boot protocol
                                             #version 2.06

# End of setup header #####################################################

.section ".inittext", "ax"
#到这里了＊＊＊＊＊＊＊＊＊＊＊＊
start_of_setup:
#ifdef SAFE_RESET_DISK_CONTROLLER
# Reset the disk controller.
movw $0x0000, %ax # Reset disk controller
movb $0x80, %dl # All disks
int $0x13
#endif

# We will have entered with %cs = %ds+0x20, normalize %cs so
# it is on par with the other segments.
根据上面说的，grub是以ljmp过来的，跳过来的时候数据段被设置为__LOAD_DS__,所以跳过来这后，cs寄存器的值为__LOAD_DS__ + 0x20，下面这个代码就是把cs重置为__LOAD_DS__（如果不这么做，由于link的时候setup代码被放在512字节之后，指令的偏移地址就不对了cs:ip就引用到后面的地址去了）
pushw %ds
pushw $setup2
lretw

setup2:
# Force %es = %ds
movw %ds, %ax
movw %ax, %es
cld

# Stack paranoia: align the stack and make sure it is good
# for both 16- and 32-bit references.  In particular, if we
# were meant to have been using the full 16-bit segment, the
# caller might have set %sp to zero, which breaks %esp-based
# references.
andw $~3, %sp # dword align (might as well...)
jnz 1f
movw $0xfffc, %sp # Make sure we're not zero
1: movzwl %sp, %esp # Clear upper half of %esp
sti

# Check signature at end of setup
cmpl $0x5a5aaa55, setup_sig
jne setup_bad
上面这段指令设置堆栈，cmpl指令是总是对的，对于正确的setup。
# Zero the bss
movw $__bss_start, %di
movw $_end+3, %cx
xorl %eax, %eax
subw %di, %cx
shrw $2, %cx
rep; stosl
上面这段代码清空setup的bbs段，我们在后面还会看到内核保护模式代码也有类似的变量名字，它们不是同一个，但是因为它们的功能是一样的，所以名字
# Jump to C code (should not return)
calll main #到C代码去了
代码在boot/main.c

# Setup corrupt somehow...
setup_bad:
movl $setup_corrupt, %eax
calll puts
# Fall through...

.globl die
.type die, @function
die:
hlt
jmp die

.size die, .-die

.section ".initdata", "a"
setup_corrupt:
.byte 7
.string "No setup signature found...\n"
三. C代码boot/main.c
上面header.S中最后是跳到main函数中的，在boot/main.c，void main()；
这个函数里面的代码我大多没有看，如果要看，可以参考ULK3中附录A讲setup函数的那一小节。看main函数中的最后一行代码
/* Do the last things and invoke protected mode */
go_to_protected_mode()；
这个函数进而又调用
protected_mode_jump(boot_params.hdr.code32_start,(u32)&boot_params + (ds() << 4));
这个函数接受两个参数，第一个参数是保护模式的第一条代码，上面说到了，在0x100000，后面这个就是给内核传递的参数，由于切换到保护模式，所以要给出参数的线性地址，而不是有效地址，ds()函数就是ds寄存器的值。
来看看这个函数,代码在pmjump.S中


/*
* void protected_mode_jump(u32 entrypoint, u32 bootparams);
*/

首先，这个函数是用寄存器传递参数的，参考boot/Makefile中的CFLAGS
protected_mode_jump:
xorl %ebx, %ebx # Flag to indicate this is a boot
movl %edx, %esi # Pointer to boot_params table #boot_params保存在%esi中
movl %eax, 2f # Patch ljmpl instruction＃ #见2f处的代码
jmp 1f # Short jump to flush instruction q.

1:
movw $__BOOT_DS, %cx

movl %cr0, %edx
orb $1, %dl # Protected mode (PE) bit
movl %edx, %cr0

movw %cx, %ds
movw %cx, %es
movw %cx, %fs
movw %cx, %gs
movw %cx, %ss

# Jump to the 32-bit entrypoint
.byte 0x66, 0xea # ljmpl opcode
2: .long 0 # offset＃这个已经被改成了 0x100000了，由于现在已经是保护模式，所以__BOOT_CS就是(gdt)的选择符
.word __BOOT_CS # segment

.size protected_mode_jump, .-protected_mode_jump

在这个函数main的最后，实际上就是已关中断，进入了保护模式，设置了最初始的gdt,idt等。
而且代码已经转到线性地址为0x100000处执行了，就是本文一开始说的保护模式的代码，这代码在arch/i386/kernel/head.S中，这样就进入了ULK3中的startup_32函数，文艺复兴时代。
（段的基地址为0,*可以参考go_to_protected_mode()中的setup_gdt()函数，*，所以线性地址等于有效地址，因为目前还没有分页，所以线性地址也其实就是物理地址,物理地址1M后正是保护模式代码所在地）因为我不想说太多保护模式的东西，所以就不列出这段代码了，包括很多相关的知识也不说了，因为看文档容易，解释就太难了，随便一篇文档都比我说的能清楚。

恩，现在到了保护模式了

四. arch/boot/kernel/head.S

上面说到，到了保护模式的代码了，最先执行的代码就是head.S，但是因为head.S比较长，500多行，所以我就不像上面那样列出所有的代码，有些不影响对head.S整体把握的代码我会删除掉，
一些没有什么帮助的注释我也会删除掉，但是如果列出来，那一定是按照在源代码中的顺序，以方便查阅, 省略的代码会注明。

.text
#include a lot of .h
...

/*
* References to members of the new_cpu_data structure.
*/

#define X86 new_cpu_data+CPUINFO_x86
#define X86_VENDOR new_cpu_data+CPUINFO_x86_vendor
#define X86_MODEL new_cpu_data+CPUINFO_x86_model
#define X86_MASK new_cpu_data+CPUINFO_x86_mask
#define X86_HARD_MATH new_cpu_data+CPUINFO_hard_math
#define X86_CPUID new_cpu_data+CPUINFO_cpuid_level
#define X86_CAPABILITY new_cpu_data+CPUINFO_x86_capability
#define X86_VENDOR_ID new_cpu_data+CPUINFO_x86_vendor_id

...一些描述下面宏的注释
LOW_PAGES = 1<<(32-PAGE_SHIFT_asm)

#if PTRS_PER_PMD > 1
PAGE_TABLE_SIZE = (LOW_PAGES / PTRS_PER_PMD) + PTRS_PER_PGD
#else
PAGE_TABLE_SIZE = (LOW_PAGES / PTRS_PER_PGD)
#endif
BOOTBITMAP_SIZE = LOW_PAGES / 8
ALLOCATOR_SLOP = 4

INIT_MAP_BEYOND_END = BOOTBITMAP_SIZE + (PAGE_TABLE_SIZE + ALLOCATOR_SLOP)*PAGE_SIZE_asm

/*
* 32-bit kernel entrypoint; only used by the boot CPU.  On entry,
* %esi points to the real-mode code as a 32-bit pointer.
* CS and DS must be 4 GB flat segments, but we don't depend on
* any particular GDT layout, because we load our own as soon as we
* can.
*/
＃setup中的set_gdt仅仅是为了进入保护模式后的ljmp
.section .text.head,"ax",@progbits #代码放在.text.head节中
ENTRY(startup_32) #在这里，第一条保护模式的指令开始了

/*
* Set segments to known values.
*/
cld
lgdt boot_gdt_descr - __PAGE_OFFSET ＃3G,0xC0000000,众所周知的

下面好几个地方都有- __PAGE_OFFSET,这是因为要引用某个变量所在的地址，那么必须找到物理地址，而现在线性地址就是物理地址(因为没有分页)，而实际上变量的偏移的值都是实际的vmlinuz＋0xC0000000＋0x100000（因为内核最终要分页，所以连接的时候都是相对这个偏移，一会再说说），所以如果不- __PAGE_OFFSET,那么比如上面boot_gdt_descr的值就是 0xC0100000+n, n是个不大的值，是vmlinuz中boot_gdt_desc 相对保护模式开始的偏移。这样，boot_gdt_descr - __PAGE_OFFSET之后就是0x100000+n,这正是boot_gdt_descr所在物理地地址
movl $(__BOOT_DS),%eax
movl %eax,%ds
movl %eax,%es
movl %eax,%fs
movl %eax,%gs

/*
* Clear BSS first so that there are no surprises...
* No need to cld as DF is already clear from cld above...
*/
xorl %eax,%eax
movl $__bss_start - __PAGE_OFFSET,%edi ＃别把这个和setup中的混淆
movl $__bss_stop - __PAGE_OFFSET,%ecx ＃
subl %edi,%ecx
shrl $2,%ecx
rep ; stosl
/*
* Copy bootup parameters out of the way.
* Note: %esi still has the pointer to the real-mode data.
* With the kexec as boot loader, parameter segment might be loaded beyond
* kernel image and might not even be addressable by early boot page tables.
* (kexec on panic case). Hence copy out the parameters before initializing
* page tables.
*/
＃我们看到protected_mode_jump函数把setup中的boot_params的参数放到%esi中了
复制一份，再一次, 这里boot_params变量和setup中的也不是同一个
movl $(boot_params - __PAGE_OFFSET),%edi
movl $(PARAM_SIZE/4),%ecx
cld
rep
movsl
movl boot_params - __PAGE_OFFSET + NEW_CL_POINTER,%esi
andl %esi,%esi
jnz 2f # New command line protocol
cmpw $(OLD_CL_MAGIC),OLD_CL_MAGIC_ADDR
jne 1f
movzwl OLD_CL_OFFSET,%esi
addl $(OLD_CL_BASE_ADDR),%esi
2:
＃负责命令行参数比如init=/bin/bash,console=ttyS0
movl $(boot_command_line - __PAGE_OFFSET),%edi
movl $(COMMAND_LINE_SIZE/4),%ecx
rep
movsl
1:

/*
* Initialize page tables.  This creates a PDE and a set of page
* tables, which are located immediately beyond _end.  The variable
* init_pg_tables_end is set up to point to the first "safe" location.
* Mappings are created both at virtual address 0 (identity mapping)
* and PAGE_OFFSET for up to _end+sizeof(page tables)+INIT_MAP_BEYOND_END.
*
* Warning: don't use %esi or the stack in this code.  However, %esp
* can be used as a GPR if you really need it...
*/
page_pde_offset = (__PAGE_OFFSET >> 20);

movl $(pg0 - __PAGE_OFFSET), %edi
movl $(swapper_pg_dir - __PAGE_OFFSET), %edx
movl $0x007, %eax /* 0x007 = PRESENT+RW+USER */
10:
leal 0x007(%edi),%ecx /* Create PDE entry */
movl %ecx,(%edx) /* Store identity PDE entry */
movl %ecx,page_pde_offset(%edx) /* Store kernel PDE entry */
#上面两条语句建立0和3G的页目录项，它们都指向也表地址pg0
addl $4,%edx 这条语句是为了以后再运行到这里来，页目录项前移。见如下的jb 10b 指令
movl $1024, %ecx 先来1024个页表项 1024*4K == 4M
11:
stosl
addl $0x1000,%eax ＃现在，使用0x1000 == 4K的页
loop 11b
/* End condition: we must map up to and including INIT_MAP_BEYOND_END */
/* bytes beyond the end of our own page tables; the +0x007 is the attribute bits */
leal (INIT_MAP_BEYOND_END+0x007)(%edi),%ebp
cmpl %ebp,%eax
jb 10b ＃确保分页之后能访问到足够的地址

上面这段代码建立足够的页表和页表项，这段代码如果不熟悉x86分页可能稍微难懂一点
参考一下ULK3第二章的“临时内核页表”一小节，虽然有点不一样，但是还是很有帮助
movl %edi,(init_pg_tables_end - __PAGE_OFFSET) ＃＊＊＊

xorl %ebx,%ebx /* This is the boot CPU (BSP) */
jmp 3f

＃这里删除了一些代码，多处理器的

3:

/*
* Enable paging
*/
movl $swapper_pg_dir-__PAGE_OFFSET,%eax
movl %eax,%cr3 /* set the page table pointer.. */
movl %cr0,%eax
orl $0x80000000,%eax
movl %eax,%cr0 /* ..and set paging (PG) bit */
ljmp $__BOOT_CS,$1f /* Clear prefetch and normalize %eip */ *长跳转*
从这里开始，分页就完成了，这样就再也不需要 - __PAGE_OFFSET，因为它们的高10BIT都映射到同一个页表。

1:
/* Set up the stack pointer */
lss stack_start,%esp

/*
* Initialize eflags.  Some BIOS's leave bits like NT set.  This would
* confuse the debugger if this code is traced.
* XXX - best to initialize before switching to protected mode.
*/
pushl $0
popfl
这里有几行多处理器代码
/*
* start system 32-bit setup. We need to re-do some of the things done
* in 16-bit mode for the "real" operations.
*/
call setup_idt ＃这里为每一个中断准备最早的处理函数，可以往下看看，然后再回来

这里省略了一些代码，检查一下CPU的类型和参数，包括一些多处理器的代码

jmp start_kernel 好了，到了Modern Age

/*
*  setup_idt
*
*  sets up a idt with 256 entries pointing to
*  ignore_int, interrupt gates. It doesn't actually load
*  idt - that can be done only after paging has been enabled
*  and the kernel moved to PAGE_OFFSET. Interrupts
*  are enabled elsewhere, when we can be relatively
*  sure everything is ok.
*
*  Warning: %esi is live across this function.
*/
setup_idt:
lea ignore_int,%edx
movl $(__KERNEL_CS << 16),%eax
movw %dx,%ax /* selector = 0x0010 = cs */
movw $0x8E00,%dx /* interrupt gate - dpl=0, present */

lea idt_table,%edi
mov $256,%ecx
rp_sidt:
movl %eax,(%edi)
movl %edx,4(%edi)
addl $8,%edi
dec %ecx
jne rp_sidt

下面代码改变几个中断／异常处理函数

.macro set_early_handler handler,trapno
lea \handler,%edx
movl $(__KERNEL_CS << 16),%eax
movw %dx,%ax
movw $0x8E00,%dx /* interrupt gate - dpl=0, present */
lea idt_table,%edi
movl %eax,8*\trapno(%edi)
movl %edx,8*\trapno+4(%edi)
.endm

set_early_handler handler=early_divide_err,trapno=0
set_early_handler handler=early_illegal_opcode,trapno=6
set_early_handler handler=early_protection_fault,trapno=13
set_early_handler handler=early_page_fault,trapno=14

ret

这里省略了几个hander的实现代码，比较简单，基本就是停机了

.section .text
/*
* Real beginning of normal "text" segment
*/
ENTRY(stext)
ENTRY(_stext)
＃注意一下上面的注释，然后一会儿再说
/*
* BSS section
*/
.section ".bss.page_aligned","wa"
.align PAGE_SIZE_asm
ENTRY(swapper_pg_dir)
.fill 1024,4,0
这里省略一点点东西

/*
* This starts the data section.
*/
.data
ENTRY(stack_start)
.long init_thread_union+THREAD_SIZE
.long __BOOT_DS

ready: .byte 0

early_recursion_flag:
.long 0

int_msg:
.asciz "Unknown interrupt or fault at EIP %p %p %p\n"

fault_msg:
.ascii "Int %d: CR2 %p  err %p  EIP %p  CS %p  flags %p\n"
.asciz "Stack: %p %p %p %p %p %p %p %p\n"

#include "../xen/xen-head.S"

/*
* The IDT and GDT 'descriptors' are a strange 48-bit object
* only used by the lidt and lgdt instructions. They are not
* like usual segment descriptors - they consist of a 16-bit
* segment size, and 32-bit linear address value:
*/

.globl boot_gdt_descr
.globl idt_descr

ALIGN
# early boot GDT descriptor (must use 1:1 address mapping)
.word 0 # 32 bit align gdt_desc.address
boot_gdt_descr:
.word __BOOT_DS+7
.long boot_gdt - __PAGE_OFFSET

.word 0 # 32-bit align idt_desc.address
idt_descr:
.word IDT_ENTRIES*8-1 # idt contains 256 entries
.long idt_table

# boot GDT descriptor (later on used by CPU#0):
.word 0 # 32 bit align gdt_desc.address
ENTRY(early_gdt_descr)
.word GDT_ENTRIES*8-1
.long per_cpu__gdt_page /* Overwritten for secondary CPUs */

/*
* The boot_gdt must mirror the equivalent in setup.S and is
* used only for booting.
*/
.align L1_CACHE_BYTES #这个表示保证下面这些字节在最靠近CPU的同一个cacheline中，因为它的使用是非常频繁的，而L1 CACHE是几级中最少最宝贵的
ENTRY(boot_gdt)
.fill GDT_ENTRY_BOOT_CS,8,0
.quad 0x00cf9a000000ffff /* kernel 4GB code at 0x00000000 */
.quad 0x00cf92000000ffff /* kernel 4GB data at 0x00000000 */

五. 现代start_kernel
这是第一次跑出arch/i386目录了，它在linux/init/main.c，以下说的代码出了少量的都在linux/init/目录中。
start_kernel之后体系结构相关的就少了,由于这个函数里面的东东太多了，我也没怎么看，所以本来到这就没法再讲了。但是我还是试图讲一些我了解的内容，直到启动了init进程，略去start_kernel内核初始化的绝大多数代码。
顺便插入说一句，这个函数的代码被放在.text.init中，就向模块初始化函数一样，它仅仅被执行一次，内核在最后阶段会收回这个分节所占用的内存

void free_initmem(void)

{

      free_init_pages("unused kernel memory",

                     (unsigned long)(&__init_begin),

                     (unsigned long)(&__init_end));

}

好了，说到start_kernel，它内核初始化之后，最后一行代码是
   /* Do the rest non-__init'ed, we're now alive */

      rest_init();
它首先就是执行kernel_thread(kernel_init, NULL, CLONE_FS | CLONE_SIGHAND);
得到一个内核线程，然后最终执行cpu_idle();它基本就是节省机器的体力，谁要CPU就让给谁，所以就不说了，我们沿着流程走，接下来是kernel_init函数，关于内核线程，请参考一下ULK3第3章

static int __init kernel_init(void * unused)

{

/*一系列初始化*/
      do_basic_setup();

      /*

      * check if there is an early userspace init.  If yes, let it do all

      * the work

      */

      if (!ramdisk_execute_command)

            ramdisk_execute_command = "/init";

      if (sys_access((const char __user *) ramdisk_execute_command, 0) != 0) {

            ramdisk_execute_command = NULL;

            prepare_namespace();

      }

/*

      * Ok, we have completed the initial bootup, and

      * we're essentially up and running. Get rid of the

      * initmem segments and start the user-mode stuff..

      */

      init_post();

      return 0;

}

//endof kernel_init

在一系列初始化之后
1.调用 do_basic_setup函数
2.然后就init进程的执行了
3.最后内核就处于不断等待服务的过程了。

所以接下来说do_basic_setup函数

/*

* Ok, the machine is now initialized. None of the devices

* have been touched yet, but the CPU subsystem is up and

* running, and memory and process management works.

*

* Now we can finally start doing some real work..

*/

static void __init do_basic_setup(void)

{

      /* drivers will send hotplug events */

      init_workqueues();

      usermodehelper_init();

      driver_init();

      init_irq_proc();

      do_initcalls();

}

它调用的前几个函数我就不解释了，说说后面的 do_initcalls();

首先说一下，内核中有一个专门的节，用来存放初始化末尾要被调用的函数
举个例子init/initramfs.c中的最后一句是
rootfs_initcall(populate_rootfs);
而
#define rootfs_initcall(fn)          __define_initcall("rootfs",fn,rootfs)
#define __define_initcall(level,fn,id) \

      static initcall_t __initcall_##fn##id __attribute_used__ \

      __attribute__((__section__(".initcall" level ".init"))) = fn
可以看出来这个populate_rootfs被放在.initcallrootfs.init中了。
然后呢，这个节在连接的时候会被output到.initcall.init节中（为了避免交叉过多，先不说了）

好了，在明白了initcall相关信息后，继续看看do_initcalls()；
它的本质就是
      for (call = __initcall_start; call < __initcall_end; call++)

            result = (*call)();

这样所有的initcall就会被调用了，这是为了避免把所有代码集合到一块的一个方法，虽然这带来了一个问题，谁在前谁在后？内核专门为这个大节分了一些类，哪些在前哪些在后连接的时候会安排的。

正好，现在知道 populate_rootfs会被执行，它处理initfamfs和initrd,首先是执行unpack_to_rootfs，然后检查是否有initrd. 代码就不列出来了。

再回到调用do_basic_setup()的kthread_init中
接下来
if (!ramdisk_execute_command)

         ramdisk_execute_command = "/init";

   if (sys_access((const char __user *) ramdisk_execute_command, 0) != 0) {

            ramdisk_execute_command = NULL;

            prepare_namespace();

      }

这段代码检查是否有必要mount根文件系统，如果vmlinuz中带有initfamfs，而且其中已经有init，那么就不这么做了(我现在工作用的目标系统就是这样的，里面有个init)，否则的话内核还要mount init所在的（也是所有用户态进程的最除根文件系统）根文件系统，挂在根文件系统和执行init是linux启动过程最后要做的事情。
好了，就不说mount_root的细节了，prepare_namespace还是很值得一看的，可以去翻源代码。如果它失败了，就是panic VFS no root found 这样的错误了。现在假设已经有了根文件系统了，这样就到了kthread_init中的最后一条函数调用了

/* This is a non __init function. Force it to be noinline otherwise gcc

* makes it inline to init() and it becomes part of init.text section

*/

static int noinline init_post(void)

{

      free_initmem();
//这就是上面说到的那个释放init代码的函数了，它显然不能__init
      unlock_kernel();

      mark_rodata_ro();

      system_state = SYSTEM_RUNNING;

      numa_default_policy();

      if (sys_open((const char __user *) "/dev/console", O_RDWR, 0) < 0)

            printk(KERN_WARNING "Warning: unable to open an initial console.\n");

      (void) sys_dup(0);

      (void) sys_dup(0);

      if (ramdisk_execute_command) {

            run_init_process(ramdisk_execute_command);

            printk(KERN_WARNING "Failed to execute %s\n",

                              ramdisk_execute_command);

      }

/*

      * We try each of these until one succeeds.

      *

      * The Bourne shell can be used instead of init if we are

      * trying to recover a really broken machine.

      */

      if (execute_command) {

            run_init_process(execute_command);

            printk(KERN_WARNING "Failed to execute %s.  Attempting "

                                    "defaults...\n", execute_command);

      }

      run_init_process("/sbin/init");

      run_init_process("/etc/init");

      run_init_process("/bin/init");

      run_init_process("/bin/sh");

      panic("No init found.  Try passing init= option to kernel.");

}


这个函数基本上就是执行init了，失败就panic了。
顺便说一句，/dev/console最后被012描述符引用，也就是所有没有reopen的进程的标准输入输出和出错。

到此，内核启动过程就完成了，init根据根文件系统的配置在初始化用户态的进程，启动系统

六. 连接

连接的过程是由arch/i386/kernel/vmlinux.lds.S和arch/i386/boot/setup.ld控制的

setup.ld

/*

* setup.ld

*

* Linker script for the i386 setup code

*/

OUTPUT_FORMAT("elf32-i386", "elf32-i386", "elf32-i386")

OUTPUT_ARCH(i386)

ENTRY(_start)

SECTIONS

{

      . = 0;

      .bstext       : { *(.bstext) }
header.S中的代码
      .bsdata       : { *(.bsdata) }

      . = 497;

      .header       : { *(.header) }
刚好够512字节

      .inittext    : { *(.inittext) }
header.S中的_start开始的代码
      .initdata    : { *(.initdata) }

      .text          : { *(.text*) }
boot/中的C代码

      . = ALIGN(16);

      .rodata       : { *(.rodata*) }

.videocards    : {

            video_cards = .;

            *(.videocards)

            video_cards_end = .;

      }

      . = ALIGN(16);

      .data          : { *(.data*) }

      .signature    : {

            setup_sig = .;
#记得有一条cmpl指令？
            LONG(0x5a5aaa55)

      }

      . = ALIGN(16);

      .bss          :

      {

            __bss_start = .;

            *(.bss)

            __bss_end = .;

      }
setup中清空bss的代码引用了这两个符号
      . = ALIGN(16);

      _end = .;

      /DISCARD/ : { *(.note*) }

      . = ASSERT(_end <= 0x8000, "Setup too big!");

      . = ASSERT(hdr == 0x1f1, "The setup header has the wrong offset!");

不能太长32K以下
}

vmlinux.lds.S就不全写出来了，举个例子刚才说到initcall段的安排

.initcall.init : AT(ADDR(.initcall.init) - LOAD_OFFSET) {

      __initcall_start = .;

      INITCALLS

      __initcall_end = .;

  }

＃define INITCALLS \
#define INITCALLS                                                    \

a lot here
\
      *(.initcallrootfs.init)                                        \

a lot here

内核还有很多特殊的节，比如
#define __EXPORT_SYMBOL(sym, sec)                            \

      extern typeof(sym) sym;                               \

      __CRC_SYMBOL(sym, sec)                               \

      static const char __kstrtab_##sym[]                   \

      __attribute__((section("__ksymtab_strings")))          \

      = MODULE_SYMBOL_PREFIX #sym;                         \

      static const struct kernel_symbol __ksymtab_##sym    \

      __attribute_used__                                     \

      __attribute__((section("__ksymtab" sec), unused))    \

      = { (unsigned long)&sym, __kstrtab_##sym }

#define EXPORT_SYMBOL(sym)                                     \

      __EXPORT_SYMBOL(sym, "")

#define EXPORT_SYMBOL_GPL(sym)                               \

      __EXPORT_SYMBOL(sym, "_gpl")

用于导出符号，因为内核不象ld生成的可执行文件，它需要对别的文件进行重定位，所以需要这样的信息。上面的例子_GPL的符号节放在不同的节，所以再对一个模块中的符号进行重定位时，如果这个模块代码不按GPL发布，就不搜索__ksymtab_gpl节的符号了

七. 参考文档
以下是我直接参考过信息来源
linux/Documentation/i386/boot.txt
linux/Documentation/initrd.txt
linux/Documentation/kbuild/makefile.txt
ld.pdf gcc.pdf
ULK3 第2,3,4,9,12章，附录A，B
linux代码 intel 开发手册卷3A中的几十页

阅读(1550) | 评论(0) | 转发(0) |

上一篇：screen使用笔记

下一篇：Git Cheat Sheet

给主人留下些什么吧！~~

感谢所有关心和支持过ChinaUnix的朋友们

16024965号-6