http://bbs.chinaunix.net/thread-2087843-1-1.html
linux需要启动过程我觉得很复杂,查了很多资料,还不是特别清楚,所以我就结合各种资料看了一下linux/arch/i386下的一点代码,现贴出一点解释,希望能有帮助 因为基本上都是自己的理解,所以可能有不正确的地方,欢迎指出 。一. 首先先看看/boot/vmlinuz-2.6.23 这个是内核文件, 启动系统的时候bootloader会把它加载到内存,然后执行它。 虽然这是众所周知的,但是其实过程比较复杂。linux使用很多bootload,在多种体系结构上跑,因为我看的代码是i386下的,所以这里就说grub了。复杂的地方很多,比如grub怎么能读文件,把它加载到哪,怎么执行内核。至于第一个问题,它能根据grub第一块512字节中的数据(bootsect)来判断第二步骤代码(512字节显然不能识别文件系统)在磁盘中的什么块,这些数据是grub安装的时候记录的,然后执行第二步骤,所以就能识别文件系统了。 现在就假设grub已经能识别文件系统了,接下来 首先grub拿到vmlinuz(以后就简称这个了),怎么处理它呢,这就要协议,(详细信息参考linux-2.6.23/Documentation/i386/boot.txt). 先来看看vmlinuz由什么构成: 第一个512字节 第二个一段代码,若干不多个512字节(一会再说它多大) 保护模式下的内核代码 第一个512就是通常的启动扇区,对应于ULK3的远古时代(但是它有点特殊,因为它现在并不用作启动扇区,一会儿会看到),以前是在arch/i386/boot/bootsect.S中,但是现在看看代码就知道,它和第二段代码(的部分)合并到arch/i386/boot/header.S中了。 第二段是实模式下的setup代码,对于ULK3中说的中世纪。如上所述,现在的版本setup部分内容和第一部分合并为header.S,而setup代码的其它内容来自boot目录下的其它一些源文件(参考boot/Makefile),这段代码的大小,在连接setup的时候会得到,然后这个数据会写入第一个512字节中的偏移(也是整个vmlinuz的偏移)位置为0x01F1处的一个字节。它的值表示第一段+第二段所占据的512字节的个数,比如 xuchm@debian:~/doc$ od -j 0x01F1 -N1 /boot/vmlinuz-2.6.23 -D 0000761 21 0000762 表示21*512 == (1+20)*512,就说明第二段的长度为20*512 == 10K的大小 第三段就是所有的内核的其它代码构成的,因为第三部分代码是进入保护模式后执行的,所以和setup相反,被称为保护模式的代码。 有点要注意的是,我这里说的是没有压缩的内核。对于ULK3中的解压就没了。 好了,在对内核文件的构成有个了解之后来看看加载过程 先看一段文字(Documentation/i386/boot.txt) For a modern bzImage kernel with boot protocol version >= 2.02, amemory layout like the following is suggested: ~ ~ | Protected-mode kernel |#这是上面说的保护模式代码,grub会把它放到这里100000 +------------------------+ | I/O memory hole |0A0000 +------------------------+ | Reserved for BIOS | Leave as much as possible unused ~ ~ | Command line | (Can also be below the X+10000 mark)X+10000 +------------------------+ | Stack/heap | For use by the kernel real-mode code.X+08000 +------------------------+ | Kernel setup | The kernel real-mode code. | Kernel boot sector | The kernel legacy boot sector. #上面这两个就是vmlinuz的前面两部分的代码和数据X +------------------------+ | Boot loader | <- Boot sector entry point 0000:7C00001000 +------------------------+ | Reserved for MBR/BIOS |000800 +------------------------+ | Typically used by MBR |000600 +------------------------+ | BIOS use only |000000 +------------------------+... where the address X is as low as the design of the boot loaderpermits. 可以看到,vmlinuz的前两部分在一起,而具体在哪并不是需要固定的,要看grub本身的大小(我把这两部分叫实模式下的代码,虽然上面这段文字说的只是setup部分),而第三部分在0x100000(是不是在这里这由0x0211字节的内容提示,下面也会说到)二. 汇编代码header.S 如上所述第一和第二段(部分)代码在boot/header.S中。所以就一起看看,为了完整,都贴出来了,所以占用了不少位置,我的注释都是中文 再从加载开始说,如上所述,vmlinuz保护模式的代码加载到0x100000开始的位置。而实模式的代码(再一次,指bootsect和setup两部分),因为被加载的位置不要求是固定的,也就是上面说到的 Kernel setup | The kernel real-mode code. Kernel boot sector 它们的位置会受grub大小的影响。 所以不妨假设被加载到的逻辑地址为 __LOAD_DS__:0000; 接下来看代码/** header.S** Copyright (C) 1991, 1992 Linus Torvalds** Based on bootsect.S and setup.S* modified by more people than can be counted** Rewritten as a common file by H. Peter Anvin (Apr 2007)** BIG FAT NOTE: We're in real mode using 64k segments. Therefore segment* addresses must be multiplied by 16 to obtain their respective linear* addresses. To avoid confusion, linear addresses are written using leading* hex while segment addresses are written as segment:offset.**/#include #include #include #include #include #include #include "boot.h"SETUPSECTS = 4 /* default nr of setup-sectors */BOOTSEG = 0x07C0 /* original address of boot-sector */SYSSEG = DEF_SYSSEG /* system loaded at 0x10000 (65536) */SYSSIZE = DEF_SYSSIZE /* system size: # of 16-byte clicks */ /* to be loaded */ROOT_DEV = 0 /* ROOT_DEV is now written by "build" */SWAP_DEV = 0 /* SWAP_DEV is now written by "build" */#ifndef SVGA_MODE#define SVGA_MODE ASK_VGA#endif#ifndef RAMDISK#define RAMDISK 0#endif#ifndef ROOT_RDONLY#define ROOT_RDONLY 2#endif .code16 .section ".bstext", "ax" #注意这个节的名称 .global bootsect_startbootsect_start:#这开始是bootsect代码,也就是vmlinuz第一个512字节的源代码 # Normalize the start address ljmp $BOOTSEG, $start2 # 0x07C0:0000 如果从这里开始执行,那么说明是被BIOS直接加载过来的,这是不允许的,因为现在linux需要一个bootloader,这也就是上面说的bootsect有点特殊的地方,就是说它并没打算用来执行。所以万一它被作为bootsect由BIOS直接执行,那么就直接提示reboot. 可以拿vmware实验一下 dd if=/boot/vmlinuz-2.6.23 of=vm.img bs=512 count=1 然后用vm.img作为软盘启动。就会看到下面这段提示信息。 这段代码的作用就是打印消息并等待重启,就不多说了start2: movw %cs, %ax movw %ax, %ds movw %ax, %es movw %ax, %ss xorw %sp, %sp sti cld movw $bugger_off_msg, %simsg_loop: lodsb andb %al, %al jz bs_die movb $0xe, %ah movw $7, %bx int $0x10 jmp msg_loopbs_die: # Allow the user to press a key, then reboot xorw %ax, %ax int $0x16 int $0x19 # int 0x19 should never return. In case it does anyway, # invoke the BIOS reset code... ljmp $0xf000,$0xfff0 .section ".bsdata", "a" #注意节bugger_off_msg: .ascii "Direct booting from floppy is no longer supported.\r\n" .ascii "Please use a boot loader program instead.\r\n" .ascii "\n" .ascii "Remove disk and press any key to reboot . . .\r\n" .byte 0 # Kernel attributes; used by setup. This is part 1 of the # header, from the old boot sector. #这也是vmlinuz前512字节的内容,只是它是数据 .section ".header", "a" #注意节的名字 .globl hdrhdr:setup_sects: .byte SETUPSECTSroot_flags: .word ROOT_RDONLYsyssize: .long SYSSIZEram_size: .word RAMDISKvid_mode: .word SVGA_MODEroot_dev: .word ROOT_DEVboot_flag: .word 0xAA55 #熟悉吧 #以上定义了3个节,.bstext,.bsdata,.header,这3个节共同构成了上面说的vmlinuz的第一个512字节,接下来就是中世纪的代码了,也是正常情况下内核接管bootloader执行的*第一条代码*所在地 上面说到,grub把实模式代码加载到__LOAD_DS__:0000,那么grub怎么执行这条代码呢 jmp_far(__LOAD_DS__+0x20, 0); /* Run the kernel */ 0x20加在段上就是0x200个字节(实模式下逻辑地址到线性地址的计算方法seg<<4+offset),也就是跳到vmlinuz实模式代码setup执行了,因为俄这样就略过了512字节的bootsect. 这实际上就是grub中一条指令,ljmp 好了,执行流到了setup代码了,也就是vmlinuz第二个512字节处。也就是_start # offset 512, entry point .globl _start_start: # Explicitly enter this as bytes, or the assembler # tries to generate a 3-byte jump here, which causes # everything else to push off to the wrong offset. .byte 0xeb # short (2-byte) jump .byte start_of_setup-1f #第一条指令,这是一条短跳转,跳过了一些数据,由于以后会再提及这些数据,所以现在接着看start_of_setup:1: # Part 2 of the header, from the old setup.S .ascii "HdrS" # header signature .word 0x0206 # header version number (>= 0x0105) # or else old loadlin-1.5 will fail) .globl realmode_swtchrealmode_swtch: .word 0, 0 # default_switch, SETUPSEGstart_sys_seg: .word SYSSEG .word kernel_version-512 # pointing to kernel version string # above section of header is compatible # with loadlin-1.5 (header v1.5). Don't # change it.type_of_loader: .byte 0 # = 0, old one (LILO, Loadlin, # Bootlin, SYSLX, bootsect...) # See Documentation/i386/boot.txt for # assigned ids# flags, unused bits must be zero (RFU) bit within loadflagsloadflags:LOADED_HIGH = 1 # If set, the kernel is loaded highCAN_USE_HEAP = 0x80 # If set, the loader also has set # heap_end_ptr to tell how much # space behind setup.S can be used for # heap purposes. # Only the loader knows what is free#ifndef __BIG_KERNEL__ .byte 0#else .byte LOADED_HIGH #endifsetup_move_size: .word 0x8000 # size to move, when setup is not # loaded at 0x90000. We will move setup # to 0x90000 then just before jumping # into the kernel. However, only the # loader knows how much data behind # us also needs to be loaded.code32_start: # here loaders can put a different # start address for 32-bit code.#ifndef __BIG_KERNEL__ .long 0x1000 # 0x1000 = default for zImage#else .long 0x100000 # 0x100000 = default for big kernel #指示装载保护模式代码到0x100000(1M后开始)#endiframdisk_image: .long 0 # address of loaded ramdisk image # Here the loader puts the 32-bit # address where it loaded the image. # This only will be read by the kernel.ramdisk_size: .long 0 # its size in bytesbootsect_kludge: .long 0 # obsoleteheap_end_ptr: .word _end+1024 # (Header version 0x0201 or later) # space from here (exclusive) down to # end of setup code can be used by setup # for local heap purposes.pad1: .word 0cmd_line_ptr: .long 0 # (Header version 0x0202 or later) # If nonzero, a 32-bit pointer # to the kernel command line. # The command line should be # located between the start of # setup and the end of low # memory (0xa0000), or it may # get overwritten before it # gets read. If this field is # used, there is no longer # anything magical about the # 0x90000 segment; the setup # can be located anywhere in # low memory 0x10000 or higher.ramdisk_max: .long (-__PAGE_OFFSET-(512 << 20)-1) & 0x7fffffff # (Header version 0x0203 or later) # The highest safe address for # the contents of an initrdkernel_alignment: .long CONFIG_PHYSICAL_ALIGN #physical addr alignment #required for protected mode #kernel#ifdef CONFIG_RELOCATABLErelocatable_kernel: .byte 1#elserelocatable_kernel: .byte 0#endifpad2: .byte 0pad3: .word 0cmdline_size: .long COMMAND_LINE_SIZE-1 #length of the command line, #added with boot protocol #version 2.06# End of setup header ##################################################### .section ".inittext", "ax"#到这里了************start_of_setup:#ifdef SAFE_RESET_DISK_CONTROLLER# Reset the disk controller. movw $0x0000, %ax # Reset disk controller movb $0x80, %dl # All disks int $0x13#endif# We will have entered with %cs = %ds+0x20, normalize %cs so# it is on par with the other segments. 根据上面说的,grub是以ljmp过来的,跳过来的时候数据段被设置为__LOAD_DS__,所以跳过来这后,cs寄存器的值为__LOAD_DS__ + 0x20,下面这个代码就是把cs重置为__LOAD_DS__(如果不这么做,由于link的时候setup代码被放在512字节之后,指令的偏移地址就不对了cs:ip就引用到后面的地址去了) pushw %ds pushw $setup2 lretwsetup2:# Force %es = %ds movw %ds, %ax movw %ax, %es cld# Stack paranoia: align the stack and make sure it is good# for both 16- and 32-bit references. In particular, if we# were meant to have been using the full 16-bit segment, the# caller might have set %sp to zero, which breaks %esp-based# references. andw $~3, %sp # dword align (might as well...) jnz 1f movw $0xfffc, %sp # Make sure we're not zero1: movzwl %sp, %esp # Clear upper half of %esp sti# Check signature at end of setup cmpl $0x5a5aaa55, setup_sig jne setup_bad 上面这段指令设置堆栈,cmpl指令是总是对的,对于正确的setup。# Zero the bss movw $__bss_start, %di movw $_end+3, %cx xorl %eax, %eax subw %di, %cx shrw $2, %cx rep; stosl 上面这段代码清空setup的bbs段,我们在后面还会看到内核保护模式代码也有类似的变量名字,它们不是同一个,但是因为它们的功能是一样的,所以名字# Jump to C code (should not return) calll main #到C代码去了 代码在boot/main.c# Setup corrupt somehow...setup_bad: movl $setup_corrupt, %eax calll puts # Fall through... .globl die .type die, @functiondie: hlt jmp die .size die, .-die .section ".initdata", "a"setup_corrupt: .byte 7 .string "No setup signature found...\n"三. C代码boot/main.c 上面header.S中最后是跳到main函数中的,在boot/main.c,void main(); 这个函数里面的代码我大多没有看,如果要看,可以参考ULK3中附录A讲setup函数的那一小节。 看main函数中的最后一行代码 /* Do the last things and invoke protected mode */ go_to_protected_mode(); 这个函数进而又调用 protected_mode_jump(boot_params.hdr.code32_start,(u32)&boot_params + (ds() << 4)); 这个函数接受两个参数,第一个参数是保护模式的第一条代码,上面说到了,在0x100000,后面这个就是给内核传递的参数,由于切换到保护模式,所以要给出参数的线性地址,而不是有效地址,ds()函数就是ds寄存器的值。 来看看这个函数,代码在pmjump.S中 /** void protected_mode_jump(u32 entrypoint, u32 bootparams);*/首先,这个函数是用寄存器传递参数的,参考boot/Makefile中的CFLAGSprotected_mode_jump: xorl %ebx, %ebx # Flag to indicate this is a boot movl %edx, %esi # Pointer to boot_params table #boot_params保存在%esi中 movl %eax, 2f # Patch ljmpl instruction# #见2f处的代码 jmp 1f # Short jump to flush instruction q.1: movw $__BOOT_DS, %cx movl %cr0, %edx orb $1, %dl # Protected mode (PE) bit movl %edx, %cr0 movw %cx, %ds movw %cx, %es movw %cx, %fs movw %cx, %gs movw %cx, %ss # Jump to the 32-bit entrypoint .byte 0x66, 0xea # ljmpl opcode2: .long 0 # offset#这个已经被改成了 0x100000了,由于现在已经是保护模式,所以__BOOT_CS就是(gdt)的选择符 .word __BOOT_CS # segment .size protected_mode_jump, .-protected_mode_jump 在这个函数main的最后,实际上就是已关中断,进入了保护模式,设置了最初始的gdt,idt等。 而且代码已经转到线性地址为0x100000处执行了,就是本文一开始说的保护模式的代码,这代码在arch/i386/kernel/head.S中,这样就进入了ULK3中的startup_32函数,文艺复兴时代。 (段的基地址为0,*可以参考go_to_protected_mode()中的setup_gdt()函数,*,所以线性地址等于有效地址,因为目前还没有分页,所以线性地址也其实就是物理地址,物理地址1M后正是保护模式代码所在地) 因为我不想说太多保护模式的东西,所以就不列出这段代码了,包括很多相关的知识也不说了,因为看文档容易,解释就太难了,随便一篇文档都比我说的能清楚。 恩,现在到了保护模式了四. arch/boot/kernel/head.S 上面说到,到了保护模式的代码了,最先执行的代码就是head.S,但是因为head.S比较长,500多行,所以我就不像上面那样列出所有的代码,有些不影响对head.S整体把握的代码我会删除掉,一些没有什么帮助的注释我也会删除掉,但是如果列出来,那一定是按照在源代码中的顺序,以方便查阅, 省略的代码会注明。.text #include a lot of .h.../* * References to members of the new_cpu_data structure. */ #define X86 new_cpu_data+CPUINFO_x86 #define X86_VENDOR new_cpu_data+CPUINFO_x86_vendor #define X86_MODEL new_cpu_data+CPUINFO_x86_model #define X86_MASK new_cpu_data+CPUINFO_x86_mask #define X86_HARD_MATH new_cpu_data+CPUINFO_hard_math #define X86_CPUID new_cpu_data+CPUINFO_cpuid_level #define X86_CAPABILITY new_cpu_data+CPUINFO_x86_capability #define X86_VENDOR_ID new_cpu_data+CPUINFO_x86_vendor_id ...一些描述下面宏的注释LOW_PAGES = 1<<(32-PAGE_SHIFT_asm) #if PTRS_PER_PMD > 1 PAGE_TABLE_SIZE = (LOW_PAGES / PTRS_PER_PMD) + PTRS_PER_PGD #else PAGE_TABLE_SIZE = (LOW_PAGES / PTRS_PER_PGD) #endif BOOTBITMAP_SIZE = LOW_PAGES / 8 ALLOCATOR_SLOP = 4 INIT_MAP_BEYOND_END = BOOTBITMAP_SIZE + (PAGE_TABLE_SIZE + ALLOCATOR_SLOP)*PAGE_SIZE_asm
/*
* 32-bit kernel entrypoint; only used by the boot CPU. On entry,
* %esi points to the real-mode code as a 32-bit pointer.
* CS and DS must be 4 GB flat segments, but we don't depend on
* any particular GDT layout, because we load our own as soon as we
* can.
*/
#setup中的set_gdt仅仅是为了进入保护模式后的ljmp
.section .text.head,"ax",@progbits #代码放在.text.head节中
ENTRY(startup_32) #在这里,第一条保护模式的指令开始了
/*
* Set segments to known values.
*/
cld
lgdt boot_gdt_descr - __PAGE_OFFSET #3G,0xC0000000,众所周知的
下面好几个地方都有- __PAGE_OFFSET,这是因为要引用某个变量所在的地址,那么必须找到物理地址,而现在线性地址就是物理地址(因为没有分页),而实际上变量的偏移的值都是实际的vmlinuz+0xC0000000+0x100000(因为内核最终要分页,所以连接的时候都是相对这个偏移,一会再说说),所以如果不- __PAGE_OFFSET,那么比如上面boot_gdt_descr的值就是 0xC0100000+n, n是个不大的值,是vmlinuz中boot_gdt_desc 相对保护模式开始的偏移。这样,boot_gdt_descr - __PAGE_OFFSET之后就是0x100000+n,这正是boot_gdt_descr所在物理地地址
movl $(__BOOT_DS),%eax
movl %eax,%ds
movl %eax,%es
movl %eax,%fs
movl %eax,%gs
/*
* Clear BSS first so that there are no surprises...
* No need to cld as DF is already clear from cld above...
*/
xorl %eax,%eax
movl $__bss_start - __PAGE_OFFSET,%edi #别把这个和setup中的混淆
movl $__bss_stop - __PAGE_OFFSET,%ecx #
subl %edi,%ecx
shrl $2,%ecx
rep ; stosl
/*
* Copy bootup parameters out of the way.
* Note: %esi still has the pointer to the real-mode data.
* With the kexec as boot loader, parameter segment might be loaded beyond
* kernel image and might not even be addressable by early boot page tables.
* (kexec on panic case). Hence copy out the parameters before initializing
* page tables.
*/
#我们看到protected_mode_jump函数把setup中的boot_params的参数放到%esi中了
复制一份,再一次, 这里boot_params变量和setup中的也不是同一个
movl $(boot_params - __PAGE_OFFSET),%edi
movl $(PARAM_SIZE/4),%ecx
cld
rep
movsl
movl boot_params - __PAGE_OFFSET + NEW_CL_POINTER,%esi
andl %esi,%esi
jnz 2f # New command line protocol
cmpw $(OLD_CL_MAGIC),OLD_CL_MAGIC_ADDR
jne 1f
movzwl OLD_CL_OFFSET,%esi
addl $(OLD_CL_BASE_ADDR),%esi
2:
#负责命令行参数 比如init=/bin/bash,console=ttyS0
movl $(boot_command_line - __PAGE_OFFSET),%edi
movl $(COMMAND_LINE_SIZE/4),%ecx
rep
movsl
1:
/*
* Initialize page tables. This creates a PDE and a set of page
* tables, which are located immediately beyond _end. The variable
* init_pg_tables_end is set up to point to the first "safe" location.
* Mappings are created both at virtual address 0 (identity mapping)
* and PAGE_OFFSET for up to _end+sizeof(page tables)+INIT_MAP_BEYOND_END.
*
* Warning: don't use %esi or the stack in this code. However, %esp
* can be used as a GPR if you really need it...
*/
page_pde_offset = (__PAGE_OFFSET >> 20);
movl $(pg0 - __PAGE_OFFSET), %edi
movl $(swapper_pg_dir - __PAGE_OFFSET), %edx
movl $0x007, %eax /* 0x007 = PRESENT+RW+USER */
10:
leal 0x007(%edi),%ecx /* Create PDE entry */
movl %ecx,(%edx) /* Store identity PDE entry */
movl %ecx,page_pde_offset(%edx) /* Store kernel PDE entry */
#上面两条语句建立0和3G的页目录项,它们都指向也表地址pg0
addl $4,%edx 这条语句是为了以后再运行到这里来,页目录项前移。见如下的jb 10b 指令
movl $1024, %ecx 先来1024个页表项 1024*4K == 4M
11:
stosl
addl $0x1000,%eax #现在,使用0x1000 == 4K的页
loop 11b
/* End condition: we must map up to and including INIT_MAP_BEYOND_END */
/* bytes beyond the end of our own page tables; the +0x007 is the attribute bits */
leal (INIT_MAP_BEYOND_END+0x007)(%edi),%ebp
cmpl %ebp,%eax
jb 10b #确保分页之后能访问到足够的地址
上面这段代码建立足够的页表和页表项,这段代码如果不熟悉x86分页可能稍微难懂一点
参考一下ULK3第二章的“临时内核页表”一小节,虽然有点不一样,但是还是很有帮助
movl %edi,(init_pg_tables_end - __PAGE_OFFSET) #***
xorl %ebx,%ebx /* This is the boot CPU (BSP) */
jmp 3f
#这里删除了一些代码,多处理器的
3:
/*
* Enable paging
*/
movl $swapper_pg_dir-__PAGE_OFFSET,%eax
movl %eax,%cr3 /* set the page table pointer.. */
movl %cr0,%eax
orl $0x80000000,%eax
movl %eax,%cr0 /* ..and set paging (PG) bit */
ljmp $__BOOT_CS,$1f /* Clear prefetch and normalize %eip */ *长跳转*
从这里开始,分页就完成了,这样就再也不需要 - __PAGE_OFFSET,因为它们的高10BIT都映射到同一个页表。
1:
/* Set up the stack pointer */
lss stack_start,%esp
/*
* Initialize eflags. Some BIOS's leave bits like NT set. This would
* confuse the debugger if this code is traced.
* XXX - best to initialize before switching to protected mode.
*/
pushl $0
popfl
这里有几行多处理器代码
/*
* start system 32-bit setup. We need to re-do some of the things done
* in 16-bit mode for the "real" operations.
*/
call setup_idt #这里为每一个中断准备最早的处理函数,可以往下看看,然后再回来
这里省略了一些代码,检查一下CPU的类型和参数,包括一些多处理器的代码
jmp start_kernel 好了,到了Modern Age
/*
* setup_idt
*
* sets up a idt with 256 entries pointing to
* ignore_int, interrupt gates. It doesn't actually load
* idt - that can be done only after paging has been enabled
* and the kernel moved to PAGE_OFFSET. Interrupts
* are enabled elsewhere, when we can be relatively
* sure everything is ok.
*
* Warning: %esi is live across this function.
*/
setup_idt:
lea ignore_int,%edx
movl $(__KERNEL_CS << 16),%eax
movw %dx,%ax /* selector = 0x0010 = cs */
movw $0x8E00,%dx /* interrupt gate - dpl=0, present */
lea idt_table,%edi
mov $256,%ecx
rp_sidt:
movl %eax,(%edi)
movl %edx,4(%edi)
addl $8,%edi
dec %ecx
jne rp_sidt
下面代码改变几个中断/异常处理函数
.macro set_early_handler handler,trapno
lea \handler,%edx
movl $(__KERNEL_CS << 16),%eax
movw %dx,%ax
movw $0x8E00,%dx /* interrupt gate - dpl=0, present */
lea idt_table,%edi
movl %eax,8*\trapno(%edi)
movl %edx,8*\trapno+4(%edi)
.endm
set_early_handler handler=early_divide_err,trapno=0
set_early_handler handler=early_illegal_opcode,trapno=6
set_early_handler handler=early_protection_fault,trapno=13
set_early_handler handler=early_page_fault,trapno=14
ret
这里省略了几个hander的实现代码,比较简单,基本就是停机了
.section .text
/*
* Real beginning of normal "text" segment
*/
ENTRY(stext)
ENTRY(_stext)
#注意一下上面的注释,然后一会儿再说
/*
* BSS section
*/
.section ".bss.page_aligned","wa"
.align PAGE_SIZE_asm
ENTRY(swapper_pg_dir)
.fill 1024,4,0
这里省略一点点东西
/*
* This starts the data section.
*/
.data
ENTRY(stack_start)
.long init_thread_union+THREAD_SIZE
.long __BOOT_DS
ready: .byte 0
early_recursion_flag:
.long 0
int_msg:
.asciz "Unknown interrupt or fault at EIP %p %p %p\n"
fault_msg:
.ascii "Int %d: CR2 %p err %p EIP %p CS %p flags %p\n"
.asciz "Stack: %p %p %p %p %p %p %p %p\n"
#include "../xen/xen-head.S"
/*
* The IDT and GDT 'descriptors' are a strange 48-bit object
* only used by the lidt and lgdt instructions. They are not
* like usual segment descriptors - they consist of a 16-bit
* segment size, and 32-bit linear address value:
*/
.globl boot_gdt_descr
.globl idt_descr
ALIGN
# early boot GDT descriptor (must use 1:1 address mapping)
.word 0 # 32 bit align gdt_desc.address
boot_gdt_descr:
.word __BOOT_DS+7
.long boot_gdt - __PAGE_OFFSET
.word 0 # 32-bit align idt_desc.address
idt_descr:
.word IDT_ENTRIES*8-1 # idt contains 256 entries
.long idt_table
# boot GDT descriptor (later on used by CPU#0):
.word 0 # 32 bit align gdt_desc.address
ENTRY(early_gdt_descr)
.word GDT_ENTRIES*8-1
.long per_cpu__gdt_page /* Overwritten for secondary CPUs */
/*
* The boot_gdt must mirror the equivalent in setup.S and is
* used only for booting.
*/
.align L1_CACHE_BYTES #这个表示保证下面这些字节在最靠近CPU的同一个cacheline中,因为它的使用是非常频繁的,而L1 CACHE是几级中最少最宝贵的
ENTRY(boot_gdt)
.fill GDT_ENTRY_BOOT_CS,8,0
.quad 0x00cf9a000000ffff /* kernel 4GB code at 0x00000000 */
.quad 0x00cf92000000ffff /* kernel 4GB data at 0x00000000 */
五. 现代start_kernel
这是第一次跑出arch/i386目录了,它在linux/init/main.c,以下说的代码出了少量的都在linux/init/目录中。
start_kernel之后体系结构相关的就少了,由于这个函数里面的东东太多了,我也没怎么看,所以本来到这就没法再讲了。但是我还是试图讲一些我了解的内容,直到启动了init进程,略去start_kernel内核初始化的绝大多数代码。
顺便插入说一句,这个函数的代码被放在.text.init中,就向模块初始化函数一样,它仅仅被执行一次,内核在最后阶段会收回这个分节所占用的内存
void free_initmem(void)
{
free_init_pages("unused kernel memory",
(unsigned long)(&__init_begin),
(unsigned long)(&__init_end));
}
好了,说到start_kernel,它内核初始化之后,最后一行代码是
/* Do the rest non-__init'ed, we're now alive */
rest_init();
它首先就是执行kernel_thread(kernel_init, NULL, CLONE_FS | CLONE_SIGHAND);
得到一个内核线程,然后最终执行cpu_idle();它基本就是节省机器的体力,谁要CPU就让给谁,所以就不说了,我们沿着流程走,接下来是kernel_init函数,关于内核线程,请参考一下ULK3第3章
static int __init kernel_init(void * unused)
{
/*一系列初始化*/
do_basic_setup();
/*
* check if there is an early userspace init. If yes, let it do all
* the work
*/
if (!ramdisk_execute_command)
ramdisk_execute_command = "/init";
if (sys_access((const char __user *) ramdisk_execute_command, 0) != 0) {
ramdisk_execute_command = NULL;
prepare_namespace();
}
/*
* Ok, we have completed the initial bootup, and
* we're essentially up and running. Get rid of the
* initmem segments and start the user-mode stuff..
*/
init_post();
return 0;
}
//endof kernel_init
在一系列初始化之后
1.调用 do_basic_setup函数
2.然后就init进程的执行了
3.最后内核就处于不断等待服务的过程了。
所以接下来说do_basic_setup函数
/*
* Ok, the machine is now initialized. None of the devices
* have been touched yet, but the CPU subsystem is up and
* running, and memory and process management works.
*
* Now we can finally start doing some real work..
*/
static void __init do_basic_setup(void)
{
/* drivers will send hotplug events */
init_workqueues();
usermodehelper_init();
driver_init();
init_irq_proc();
do_initcalls();
}
它调用的前几个函数我就不解释了,说说后面的 do_initcalls();
首先说一下,内核中有一个专门的节,用来存放初始化末尾要被调用的函数
举个例子init/initramfs.c中的最后一句是
rootfs_initcall(populate_rootfs);
而
#define rootfs_initcall(fn) __define_initcall("rootfs",fn,rootfs)
#define __define_initcall(level,fn,id) \
static initcall_t __initcall_##fn##id __attribute_used__ \
__attribute__((__section__(".initcall" level ".init"))) = fn
可以看出来这个populate_rootfs被放在.initcallrootfs.init中了。
然后呢,这个节在连接的时候会被output到.initcall.init节中 (为了避免交叉过多,先不说了)
好了,在明白了initcall相关信息后,继续看看do_initcalls();
它的本质就是
for (call = __initcall_start; call < __initcall_end; call++)
result = (*call)();
这样所有的initcall就会被调用了,这是为了避免把所有代码集合到一块的一个方法,虽然这带来了一个问题,谁在前谁在后?内核专门为这个大节分了一些类,哪些在前哪些在后连接的时候会安排的。
正好,现在知道 populate_rootfs会被执行,它处理initfamfs和initrd,首先是执行unpack_to_rootfs,然后检查是否有initrd. 代码就不列出来了。
再回到调用do_basic_setup()的kthread_init中
接下来
if (!ramdisk_execute_command)
ramdisk_execute_command = "/init";
if (sys_access((const char __user *) ramdisk_execute_command, 0) != 0) {
ramdisk_execute_command = NULL;
prepare_namespace();
}
这段代码检查是否有必要mount根文件系统,如果vmlinuz中带有initfamfs,而且其中已经有init,那么就不这么做了(我现在工作用的目标系统就是这样的,里面有个init),否则的话内核还要mount init所在的(也是所有用户态进程的最除根文件系统)根文件系统,挂在根文件系统和执行init是linux启动过程最后要做的事情。
好了,就不说mount_root的细节了,prepare_namespace还是很值得一看的,可以去翻源代码。如果它失败了,就是panic VFS no root found 这样的错误了。现在假设已经有了根文件系统了,这样就到了kthread_init中的最后一条函数调用了
/* This is a non __init function. Force it to be noinline otherwise gcc
* makes it inline to init() and it becomes part of init.text section
*/
static int noinline init_post(void)
{
free_initmem();
//这就是上面说到的那个释放init代码的函数了,它显然不能__init
unlock_kernel();
mark_rodata_ro();
system_state = SYSTEM_RUNNING;
numa_default_policy();
if (sys_open((const char __user *) "/dev/console", O_RDWR, 0) < 0)
printk(KERN_WARNING "Warning: unable to open an initial console.\n");
(void) sys_dup(0);
(void) sys_dup(0);
if (ramdisk_execute_command) {
run_init_process(ramdisk_execute_command);
printk(KERN_WARNING "Failed to execute %s\n",
ramdisk_execute_command);
}
/*
* We try each of these until one succeeds.
*
* The Bourne shell can be used instead of init if we are
* trying to recover a really broken machine.
*/
if (execute_command) {
run_init_process(execute_command);
printk(KERN_WARNING "Failed to execute %s. Attempting "
"defaults...\n", execute_command);
}
run_init_process("/sbin/init");
run_init_process("/etc/init");
run_init_process("/bin/init");
run_init_process("/bin/sh");
panic("No init found. Try passing init= option to kernel.");
}
这个函数基本上就是执行init了,失败就panic了。
顺便说一句,/dev/console最后被012描述符引用,也就是所有没有reopen的进程的标准输入输出和出错。
到此,内核启动过程就完成了,init根据根文件系统的配置在初始化用户态的进程,启动系统
六. 连接
连接的过程是由arch/i386/kernel/vmlinux.lds.S和arch/i386/boot/setup.ld控制的
setup.ld
/*
* setup.ld
*
* Linker script for the i386 setup code
*/
OUTPUT_FORMAT("elf32-i386", "elf32-i386", "elf32-i386")
OUTPUT_ARCH(i386)
ENTRY(_start)
SECTIONS
{
. = 0;
.bstext : { *(.bstext) }
header.S中的代码
.bsdata : { *(.bsdata) }
. = 497;
.header : { *(.header) }
刚好够512字节
.inittext : { *(.inittext) }
header.S中的_start开始的代码
.initdata : { *(.initdata) }
.text : { *(.text*) }
boot/中的C代码
. = ALIGN(16);
.rodata : { *(.rodata*) }
.videocards : {
video_cards = .;
*(.videocards)
video_cards_end = .;
}
. = ALIGN(16);
.data : { *(.data*) }
.signature : {
setup_sig = .;
#记得有一条cmpl指令?
LONG(0x5a5aaa55)
}
. = ALIGN(16);
.bss :
{
__bss_start = .;
*(.bss)
__bss_end = .;
}
setup中清空bss的代码引用了这两个符号
. = ALIGN(16);
_end = .;
/DISCARD/ : { *(.note*) }
. = ASSERT(_end <= 0x8000, "Setup too big!");
. = ASSERT(hdr == 0x1f1, "The setup header has the wrong offset!");
不能太长32K以下
}
vmlinux.lds.S就不全写出来了,举个例子刚才说到initcall段的安排
.initcall.init : AT(ADDR(.initcall.init) - LOAD_OFFSET) {
__initcall_start = .;
INITCALLS
__initcall_end = .;
}
#define INITCALLS \
#define INITCALLS \
a lot here
\
*(.initcallrootfs.init) \
a lot here
内核还有很多特殊的节,比如
#define __EXPORT_SYMBOL(sym, sec) \
extern typeof(sym) sym; \
__CRC_SYMBOL(sym, sec) \
static const char __kstrtab_##sym[] \
__attribute__((section("__ksymtab_strings"))) \
= MODULE_SYMBOL_PREFIX #sym; \
static const struct kernel_symbol __ksymtab_##sym \
__attribute_used__ \
__attribute__((section("__ksymtab" sec), unused)) \
= { (unsigned long)&sym, __kstrtab_##sym }
#define EXPORT_SYMBOL(sym) \
__EXPORT_SYMBOL(sym, "")
#define EXPORT_SYMBOL_GPL(sym) \
__EXPORT_SYMBOL(sym, "_gpl")
用于导出符号,因为内核不象ld生成的可执行文件,它需要对别的文件进行重定位,所以需要这样的信息。上面的例子_GPL的符号节放在不同的节,所以再对一个模块中的符号进行重定位时,如果这个模块代码不按GPL发布,就不搜索__ksymtab_gpl节的符号了
七. 参考文档
以下是我直接参考过信息来源
linux/Documentation/i386/boot.txt
linux/Documentation/initrd.txt
linux/Documentation/kbuild/makefile.txt
ld.pdf gcc.pdf
ULK3 第2,3,4,9,12章,附录A,B
linux代码 intel 开发手册卷3A中的几十页
阅读(1457) | 评论(0) | 转发(0) |