一. 首先先看看/boot/vmlinuz-2.6.23
这个是内核文件,启动系统的时候bootloader会把他加载到内存,然后执行他。虽然这是众所周知的,不过其实过程比较复杂。linux使用非常多
bootload,在多种体系结构上跑,因为我看的代码是i386下的,所以这里就说grub了。复杂的地方非常多,比如grub怎么能读文件,把他加载
到哪,怎么执行内核。至于第一个问题,他能根据grub第一块512字节中的数据(bootsect)来判断第二步骤代码(512字节显然不能识别文件系
统)在磁盘中的什么块,这些数据是grub安装的时候记录的,然后执行第二步骤,所以就能识别文件系统了。
目前就假设grub已能识别文件系统了,接下来
首先grub拿到vmlinuz(以后就简称这个了),怎么处理他呢,先来看看vmlinuz由什么构成:
1. 第一个512字节
2. 第二个一段代码,若干不多个512字节(一会再说他多大)
3. 保护模式下的内核代码
第一个512就是通常的启动扇区,对应于ULK3的远古时代(不过他有点特别,因为他目前并不用作启动扇区,一会儿会看到),以前是在
arch/i386/boot/bootsect.S中,不过目前看看代码就知道,他和第二段代码(的部分)合并到arch/i386/boot/header.S中了。
第二段是实模式下的setup代码,对于ULK3中说的中世纪。如上所述,目前的版本setup部分内容和第一部分合并为header.S,而setup
代码的其他内容来自boot目录下的其他一些源文件(参考boot/Makefile),这段代码的大小,在连接setup的时候会得到,然后这个数据会
写入第一个512字节中的偏移(也是整个vmlinuz的偏移)位置为0x01F1处的一个字节。他的值表示第一段+第二段所占据的512字节的个数,比
如
xuchm@debian:~/doc$ od -j 0x01F1 -N1 /boot/vmlinuz-2.6.23 -D
0000761 21
0000762
表示21*512 == (1+20)*512,就说明第二段的长度为20*512 == 10K的大小
第三段就是所有的内核的其他代码构成的,因为第三部分代码是进入保护模式后执行的,所以和setup相反,被称为保护模式的代码。
有点要注意的是,这里说的是没有压缩的内核。对于ULK3中的解压就没了。
好了,在对内核文件的构成有个了解之后来看看加载过程
先看一段文字(Documentation/i386/boot.txt)
For a modern bzImage kernel with boot protocol version >= 2.02, a
memory layout like the following is suggested:
~ ~
| Protected-mode kernel |#这是上面说的保护模式代码,grub会把他放到这里
100000 +------------------------+
| I/O memory hole |
0A0000 +------------------------+
| Reserved for BIOS | Leave as much as possible unused
~ ~
| Command line | (Can also be below the X+10000 mark)
X+10000 +------------------------+
| Stack/heap | For use by the kernel real-mode code.
X+08000 +------------------------+
| Kernel setup | The kernel real-mode code.
| Kernel boot sector | The kernel legacy boot sector.
#上面这两个就是vmlinuz的前面两部分的代码和数据
X +------------------------+
| Boot loader |
001000 +------------------------+
| Reserved for MBR/BIOS |
000800 +------------------------+
| Typically used by MBR |
000600 +------------------------+
| BIOS use only |
000000 +------------------------+
... where the address X is as low as the design of the boot loader
permits.
能看到,vmlinuz的前两部分在一起,而具体在哪并不是需要固定的,要看grub本身的大小(我把这两部分叫实模式下的代码,虽然上面这段文字说的只是setup部分),而第三部分在0x100000(是不是在这里这由0x0211字节的内容提示,下面也会说到)
二. 汇编代码header.S
如上所述第一和第二段(部分)代码在boot/header.S中。所以就一起看看,为了完整,都贴出来了,所以占用了不少位置,我的注释都是中文
再从加载开始说,如上所述,vmlinuz保护模式的代码加载到0x100000开始的位置。而实模式的代码(再一次,指bootsect和setup两部分),因为被加载的位置不需求是固定的,也就是上面说到的
Kernel setup | The kernel real-mode code.
Kernel boot sector
他们的位置会受grub大小的影响。
所以不妨假设被加载到的逻辑地址为 __LOAD_DS__:0000;
接下来看代码
/*
* header.S
*
* Copyright (C) 1991, 1992 Linus Torvalds
*
* Based on bootsect.S and setup.S
* modified by more people than can be counted
*
* Rewritten as a common file by H. Peter Anvin (Apr 2007)
*
* BIG FAT NOTE: We’re in real mode using 64k segments. Therefore segment
* addresses must be multiplied by 16 to obtain their respective linear
* addresses. To avoid confusion, linear addresses are written using leading
* hex while segment addresses are written as segment:offset.
*
*/
#include
#include
#include
#include
#include
#include
#include "boot.h"
SETUPSECTS = 4 /* default nr of setup-sectors */
BOOTSEG = 0x07C0 /* original address of boot-sector */
SYSSEG = DEF_SYSSEG /* system loaded at 0x10000 (65536) */
SYSSIZE = DEF_SYSSIZE /* system size: # of 16-byte clicks */
/* to be loaded */
ROOT_DEV = 0 /* ROOT_DEV is now written by "build" */
SWAP_DEV = 0 /* SWAP_DEV is now written by "build" */
#ifndef SVGA_MODE
#define SVGA_MODE ASK_VGA
#endif
#ifndef RAMDISK
#define RAMDISK 0
#endif
#ifndef ROOT_RDONLY
#define ROOT_RDONLY 2
#endif
.code16
.section ".bstext", "ax" #注意这个节的名称
.global bootsect_start
bootsect_start:
#这开始是bootsect代码,也就是vmlinuz第一个512字节的原始码
# Normalize the start address
ljmp $BOOTSEG, $start2
# 0x07C0:0000
如果从这里开始执行,那么说明是被BIOS直接加载过来的,这是不允许的,因为目前linux需要一个bootloader,这也就是上面说的
bootsect有点特别的地方,就是说他并没打算用来执行。所以万一他被作为bootsect由BIOS直接执行,那么就直接提示reboot.
能拿vmware实验一下
dd if=/boot/vmlinuz-2.6.23 of=vm.img bs=512 count=1
然后用vm.img作为软盘启动。就会看到下面这段提示信息。
这段代码的作用就是打印消息并等待重启,就不多说了
start2:
movw %cs, %ax
movw %ax, %ds
movw %ax, %es
movw %ax, %ss
xorw %sp, %sp
sti
cld
movw $bugger_off_msg, %si
msg_loop:
lodsb
andb %al, %al
jz bs_die
movb $0xe, %ah
movw $7, %bx
int $0x10
jmp msg_loop
bs_die:
# Allow the user to press a key, then reboot
xorw %ax, %ax
int $0x16
int $0x19
# int 0x19 should never return. In case it does anyway,
# invoke the BIOS reset code...
ljmp $0xf000,$0xfff0
.section ".bsdata", "a" #注意节
bugger_off_msg:
.ascii "Direct booting from floppy is no longer supported.\r\n"
.ascii "Please use a boot loader program instead.\r\n"
.ascii "\n"
.ascii "Remove disk and press any key to reboot . . .\r\n"
.byte 0
# Kernel attributes; used by setup. This is part 1 of the
# header, from the old boot sector.
#这也是vmlinuz前512字节的内容,只是他是数据
.section ".header", "a" #注意节的名字
.globl hdr
hdr:
setup_sects: .byte SETUPSECTS
root_flags: .word ROOT_RDONLY
syssize: .long SYSSIZE
ram_size: .word RAMDISK
vid_mode: .word SVGA_MODE
root_dev: .word ROOT_DEV
boot_flag: .word 0xAA55 #熟悉吧
#以上定义了3个节,.bstext,.bsdata,.header,这3个节一起构成了上面说的vmlinuz的第一个512字节,接下来就是中世纪的代码了,也是正常情况下内核接管bootloader执行的*第一条代码*所在地
上面说到,grub把实模式代码加载到__LOAD_DS__:0000,那么grub怎么执行这条代码呢
jmp_far(__LOAD_DS__+0x20, 0); /* Run the kernel */
0x20加在段上就是0x200个字节(实模式下逻辑地址到线性地址的计算方法seg),也就是跳到vmlinuz实模式代码setup执行了,因为俄这
样就略过了512字节的bootsect.
这实际上就是grub中一条指令,ljmp
好了,执行流到了setup代码了,也就是vmlinuz第二个512字节处。也就是_start
# offset 512, entry point
.globl _start
_start:
# Explicitly enter this as bytes, or the assembler
# tries to generate a 3-byte jump here, which causes
# everything else to push off to the wrong offset.
.byte 0xeb # short (2-byte) jump
.byte start_of_setup-1f
#第一条指令,这是一条短跳转,跳过了一些数据,由于以后会再提及这些数据,所以目前接着看start_of_setup:
1:
# Part 2 of the header, from the old setup.S
.ascii "HdrS" # header signature
.word 0x0206 # header version number (>= 0x0105)
# or else old loadlin-1.5 will fail)
.globl realmode_swtch
realmode_swtch: .word 0, 0 # default_switch, SETUPSEG
start_sys_seg: .word SYSSEG
.word kernel_version-512 # pointing to kernel version string
# above section of header is compatible
# with loadlin-1.5 (header v1.5). Don’t
# change it.
type_of_loader: .byte 0 # = 0, old one (LILO, Loadlin,
# Bootlin, SYSLX, bootsect...)
# See Documentation/i386/boot.txt for
# assigned ids
# flags, unused bits must be zero (RFU) bit within loadflags
loadflags:
LOADED_HIGH = 1 # If set, the kernel is loaded high
CAN_USE_HEAP = 0x80 # If set, the loader also has set
# heap_end_ptr to tell how much
# space behind setup.S can be used for
# heap purposes.
# Only the loader knows what is free
#ifndef __BIG_KERNEL__
.byte 0
#else
.byte LOADED_HIGH
#endif
setup_move_size: .word 0x8000 # size to move, when setup is not
# loaded at 0x90000. We will move setup
# to 0x90000 then just before jumping
# into the kernel. However, only the
# loader knows how much data behind
# us also needs to be loaded.
code32_start: # here loaders can put a different
# start address for 32-bit code.
#ifndef __BIG_KERNEL__
.long 0x1000 # 0x1000 = default for zImage
#else
.long 0x100000 # 0x100000 = default for big kernel
#指示装载保护模式代码到0x100000(1M后开始)
#endif
ramdisk_image: .long 0 # address of loaded ramdisk image
# Here the loader puts the 32-bit
# address where it loaded the image.
# This only will be read by the kernel.
ramdisk_size: .long 0 # its size in bytes
bootsect_kludge:
.long 0 # obsolete
heap_end_ptr: .word _end+1024 # (Header version 0x0201 or later)
# space from here (exclusive) down to
# end of setup code can be used by setup
# for local heap purposes.
pad1: .word 0
cmd_line_ptr: .long 0 # (Header version 0x0202 or later)
# If nonzero, a 32-bit pointer
# to the kernel command line.
# The command line should be
# located between the start of
# setup and the end of low
# memory (0xa0000), or it may
# get overwritten before it
# gets read. If this field is
# used, there is no longer
# anything magical about the
# 0x90000 segment; the setup
# can be located anywhere in
# low memory 0x10000 or higher.
ramdisk_max: .long (-__PAGE_OFFSET-(512
# (Header version 0x0203 or later)
# The highest safe address for
# the contents of an initrd
kernel_alignment: .long CONFIG_PHYSICAL_ALIGN #physical addr alignment
#required for protected mode
#kernel
#ifdef CONFIG_RELOCATABLE
relocatable_kernel: .byte 1
#else
relocatable_kernel: .byte 0
#endif
pad2: .byte 0
pad3: .word 0
cmdline_size: .long COMMAND_LINE_SIZE-1 #length of the command line,
#added with boot protocol
#version 2.06
# End of setup header #####################################################
.section ".inittext", "ax"
#到这里了************
start_of_setup:
#ifdef SAFE_RESET_DISK_CONTROLLER
# Reset the disk controller.
movw $0x0000, %ax # Reset disk controller
movb $0x80, %dl # All disks
int $0x13
#endif
# We will have entered with %cs = %ds+0x20, normalize %cs so
# it is on par with the other segments.
根据上面说的,grub是以ljmp过来的,跳过来的时候数据段被设置为__LOAD_DS__,所以跳过来这后,cs寄存器的值为
__LOAD_DS__ +
0x20,下面这个代码就是把cs重置为__LOAD_DS__(如果不这么做,由于link的时候setup代码被放在512字节之后,指令的偏移地址
就不对了cs:ip就引用到后面的地址去了)
pushw %ds
pushw $setup2
lretw
setup2:
# Force %es = %ds
movw %ds, %ax
movw %ax, %es
cld
# Stack paranoia: align the stack and make sure it is good
# for both 16- and 32-bit references. In particular, if we
# were meant to have been using the full 16-bit segment, the
# caller might have set %sp to zero, which breaks %esp-based
# references.
andw $~3, %sp # dword align (might as well...)
jnz 1f
movw $0xfffc, %sp # Make sure we’re not zero
1: movzwl %sp, %esp # Clear upper half of %esp
sti
# Check signature at end of setup
cmpl $0x5a5aaa55, setup_sig
jne setup_bad
上面这段指令设置堆栈,cmpl指令是总是对的,对于正确的setup。
# Zero the bss
movw $__bss_start, %di
movw $_end+3, %cx
xorl %eax, %eax
subw %di, %cx
shrw $2, %cx
rep; stosl
上面这段代码清空setup的bbs段,我们在后面还会看到内核保护模式代码也有类似的变量名字,他们不是同一个,不过因为他们的功能是相同的,所以名字
# Jump to C code (should not return)
calll main #到C代码去了
代码在boot/main.c
# Setup corrupt somehow...
setup_bad:
movl $setup_corrupt, %eax
calll puts
# Fall through...
.globl die
.type die, @function
die:
hlt
jmp die
.size die, .-die
.section ".initdata", "a"
setup_corrupt:
.byte 7
.string "No setup signature found...\n"
三. C代码boot/main.c
上面header.S中最后是跳到main函数中的,在boot/main.c,void main();
这个函数里面的代码我大多没有看,如果要看,能参考ULK3中附录A讲setup函数的那一小节。看main函数中的最后一行代码
/* Do the last things and invoke protected mode */
go_to_protected_mode();
这个函数进而又调用
protected_mode_jump(boot_params.hdr.code32_start,(u32)&boot_params + (ds()
这个函数接受两个参数,第一个参数是保护模式的第一条代码,上面说到了,在0x100000,后面这个就是给内核传递的参数,由于转换到保护模式,所以要给出参数的线性地址,而不是有效地址,ds()函数就是ds寄存器的值。
来看看这个函数,代码在pmjump.S中
/*
* void protected_mode_jump(u32 entrypoint, u32 bootparams);
*/
首先,这个函数是用寄存器传递参数的,参考boot/Makefile中的CFLAGS
protected_mode_jump:
xorl %ebx, %ebx # Flag to indicate this is a boot
movl %edx, %esi # Pointer to boot_params table #boot_params保存在%esi中
movl %eax, 2f # Patch ljmpl instruction# #见2f处的代码
jmp 1f # Short jump to flush instruction q.
1:
movw $__BOOT_DS, %cx
movl %cr0, %edx
orb $1, %dl # Protected mode (PE) bit
movl %edx, %cr0
movw %cx, %ds
movw %cx, %es
movw %cx, %fs
movw %cx, %gs
movw %cx, %ss
# Jump to the 32-bit entrypoint
.byte 0x66, 0xea # ljmpl opcode
2: .long 0 # offset#这个已被改成了 0x100000了,由于目前已是保护模式,所以__BOOT_CS就是(gdt)的选择符
.word __BOOT_CS # segment
.size protected_mode_jump, .-protected_mode_jump
在这个函数main的最后,实际上就是已关中断,进入了保护模式,设置了最初始的gdt,idt等。
而且代码已转到线性地址为0x100000处执行了,就是本文一开始说的保护模式的代码,这代码在arch/i386/kernel/head.S中,这样就进入了ULK3中的startup_32函数,文艺复兴时代。
(段的基地址为0,*能参考go_to_protected_mode()中的setup_gdt()函数,*,所以线性地址等于有效地址,因为目前还没
有分页,所以线性地址也其实就是物理地址,物理地址1M后正是保护模式代码所在地)
因为我不想说太多保护模式的东西,所以就不列出这段代码了,包括非常多相关的知识也不说了,因为看文件容易,解释就太难了,随便一篇文件都比我说的能清
晰。
目前到了保护模式了
四. arch/boot/kernel/head.S
上面说到,到了保护模式的代码了,最先执行的代码就是head.S,不过因为head.S比较长,500多行,所以我就不像上面那样列出所有的代码,有些不影响对head.S整体把握的代码我会删除掉,
一些没有什么帮助的注释我也会删除掉,不过如果列出来,那一定是按照在原始码中的顺序,以方便查阅, 省略的代码会注明。
.text
#include a lot of .h
...
/*
* References to members of the new_cpu_data structure.
*/
#define X86 new_cpu_data+CPUINFO_x86
#define X86_VENDOR new_cpu_data+CPUINFO_x86_vendor
#define X86_MODEL new_cpu_data+CPUINFO_x86_model
#define X86_MASK new_cpu_data+CPUINFO_x86_mask
#define X86_HARD_MATH new_cpu_data+CPUINFO_hard_math
#define X86_CPUID new_cpu_data+CPUINFO_cpuid_level
#define X86_CAPABILITY new_cpu_data+CPUINFO_x86_capability
#define X86_VENDOR_ID new_cpu_data+CPUINFO_x86_vendor_id
...一些描述下面宏的注释
LOW_PAGES = 1
#if PTRS_PER_PMD > 1
PAGE_TABLE_SIZE = (LOW_PAGES / PTRS_PER_PMD) + PTRS_PER_PGD
#else
PAGE_TABLE_SIZE = (LOW_PAGES / PTRS_PER_PGD)
#endif
BOOTBITMAP_SIZE = LOW_PAGES / 8
ALLOCATOR_SLOP = 4
INIT_MAP_BEYOND_END = BOOTBITMAP_SIZE + (PAGE_TABLE_SIZE + ALLOCATOR_SLOP)*PAGE_SIZE_asm
/*
* 32-bit kernel entrypoint; only used by the boot CPU. On entry,
* %esi points to the real-mode code as a 32-bit pointer.
* CS and DS must be 4 GB flat segments, but we don’t depend on
* any particular GDT layout, because we load our own as soon as we
* can.
*/
#setup中的set_gdt仅仅是为了进入保护模式后的ljmp
.section .text.head,"ax",@progbits #代码放在.text.head节中
ENTRY(startup_32) #在这里,第一条保护模式的指令开始了
/*
* Set segments to known values.
*/
cld
lgdt boot_gdt_descr - __PAGE_OFFSET #3G,0xC0000000,众所周知的
下面好几个地方都有-
__PAGE_OFFSET,这是因为要引用某个变量所在的地址,那么必须找到物理地址,而目前线性地址就是物理地址(因为没有分页),而实际上变量的偏
移的值都是实际的vmlinuz+0xC0000000+0x100000(因为内核最终要分页,所以连接的时候都是相对这个偏移,一会再说说),所以如
果不- __PAGE_OFFSET,那么比如上面boot_gdt_descr的值就是 0xC0100000+n,
n是个不大的值,是vmlinuz中boot_gdt_desc 相对保护模式开始的偏移。这样,boot_gdt_descr -
__PAGE_OFFSET之后就是0x100000+n,这正是boot_gdt_descr所在物理地地址
movl $(__BOOT_DS),%eax
movl %eax,%ds
movl %eax,%es
movl %eax,%fs
movl %eax,%gs
/*
* Clear BSS first so that there are no surprises...
* No need to cld as DF is already clear from cld above...
*/
xorl %eax,%eax
movl $__bss_start - __PAGE_OFFSET,%edi #别把这个和setup中的混淆
movl $__bss_stop - __PAGE_OFFSET,%ecx #
subl %edi,%ecx
shrl $2,%ecx
rep ; stosl
/*
* Copy bootup parameters out of the way.
* Note: %esi still has the pointer to the real-mode data.
* With the kexec as boot loader, parameter segment might be loaded beyond
* kernel image and might not even be addressable by early boot page tables.
* (kexec on panic case). Hence copy out the parameters before initializing
* page tables.
*/
#我们看到protected_mode_jump函数把setup中的boot_params的参数放到%esi中了
复制一份,再一次, 这里boot_params变量和setup中的也不是同一个
movl $(boot_params - __PAGE_OFFSET),%edi
movl $(PARAM_SIZE/4),%ecx
cld
rep
movsl
movl boot_params - __PAGE_OFFSET + NEW_CL_POINTER,%esi
andl %esi,%esi
jnz 2f # New command line protocol
cmpw $(OLD_CL_MAGIC),OLD_CL_MAGIC_ADDR
jne 1f
movzwl OLD_CL_OFFSET,%esi
addl $(OLD_CL_BASE_ADDR),%esi
2:
#负责命令行参数比如init=/bin/bash,console=ttyS0
movl $(boot_command_line - __PAGE_OFFSET),%edi
movl $(COMMAND_LINE_SIZE/4),%ecx
rep
movsl
1:
/*
* Initialize page tables. This creates a PDE and a set of page
* tables, which are located immediately beyond _end. The variable
* init_pg_tables_end is set up to point to the first "safe" location.
* Mappings are created both at virtual address 0 (identity mapping)
* and PAGE_OFFSET for up to _end+sizeof(page tables)+INIT_MAP_BEYOND_END.
*
* Warning: don’t use %esi or the stack in this code. However, %esp
* can be used as a GPR if you really need it...
*/
page_pde_offset = (__PAGE_OFFSET >> 20);
movl $(pg0 - __PAGE_OFFSET), %edi
movl $(swapper_pg_dir - __PAGE_OFFSET), %edx
movl $0x007, %eax /* 0x007 = PRESENT+RW+USER */
10:
leal 0x007(%edi),%ecx /* Create PDE entry */
movl %ecx,(%edx) /* Store identity PDE entry */
movl %ecx,page_pde_offset(%edx) /* Store kernel PDE entry */
#上面两条语句建立0和3G的页目录项,他们都指向也表地址pg0
addl $4,%edx 这条语句是为了以后再运行到这里来,页目录项前移。见如下的jb 10b 指令
movl $1024, %ecx 先来1024个页表项 1024*4K == 4M
11:
stosl
addl $0x1000,%eax #目前,使用0x1000 == 4K的页
loop 11b
/* End condition: we must map up to and including INIT_MAP_BEYOND_END */
/* bytes beyond the end of our own page tables; the +0x007 is the attribute bits */
leal (INIT_MAP_BEYOND_END+0x007)(%edi),%ebp
cmpl %ebp,%eax
jb 10b #确保分页之后能访问到足够的地址
上面这段代码建立足够的页表和页表项,这段代码如果不熟悉x86分页可能稍微难懂一点
参考一下ULK3第二章的“临时内核页表”一小节,虽然有点不相同,不过还是非常有帮助
movl %edi,(init_pg_tables_end - __PAGE_OFFSET) #***
xorl %ebx,%ebx /* This is the boot CPU (BSP) */
jmp 3f
#这里删除了一些代码,多处理器的
3:
/*
* Enable paging
*/
movl $swapper_pg_dir-__PAGE_OFFSET,%eax
movl %eax,%cr3 /* set the page table pointer.. */
movl %cr0,%eax
orl $0x80000000,%eax
movl %eax,%cr0 /* ..and set paging (PG) bit */
ljmp $__BOOT_CS,$1f /* Clear prefetch and normalize %eip */ *长跳转*
从这里开始,分页就完成了,这样就再也不必 - __PAGE_OFFSET,因为他们的高10BIT都映射到同一个页表。
1:
/* Set up the stack pointer */
lss stack_start,%esp
/*
* Initialize eflags. Some BIOS’s leave bits like NT set. This would
* confuse the debugger if this code is traced.
* XXX - best to initialize before switching to protected mode.
*/
pushl $0
popfl
这里有几行多处理器代码
/*
* start system 32-bit setup. We need to re-do some of the things done
* in 16-bit mode for the "real" operations.
*/
call setup_idt #这里为每一个中断准备最早的处理函数,能往下看看,然后再回来
这里省略了一些代码,检查一下CPU的类型和参数,包括一些多处理器的代码
jmp start_kernel 好了,到了Modern Age
/*
* setup_idt
*
* sets up a idt with 256 entries pointing to
* ignore_int, interrupt gates. It doesn’t actually load
* idt - that can be done only after paging has been enabled
* and the kernel moved to PAGE_OFFSET. Interrupts
* are enabled elsewhere, when we can be relatively
* sure everything is ok.
*
* Warning: %esi is live across this function.
*/
setup_idt:
lea ignore_int,%edx
movl $(__KERNEL_CS
movw %dx,%ax /* selector = 0x0010 = cs */
movw $0x8E00,%dx /* interrupt gate - dpl=0, present */
lea idt_table,%edi
mov $256,%ecx
rp_sidt:
movl %eax,(%edi)
movl %edx,4(%edi)
addl $8,%edi
dec %ecx
jne rp_sidt
上面这代码设置IDT的所有项的入口为ignore_int,类型为中断门,细节参考intel软件研发手册卷3是第五章:
下面代码改动几个中断/异常处理函数
.macro set_early_handler handler,trapno
lea \handler,%edx
movl $(__KERNEL_CS
movw %dx,%ax
movw $0x8E00,%dx /* interrupt gate - dpl=0, present */
lea idt_table,%edi
movl %eax,8*\trapno(%edi)
movl %edx,8*\trapno+4(%edi)
.endm
set_early_handler handler=early_divide_err,trapno=0
set_early_handler handler=early_illegal_opcode,trapno=6
set_early_handler handler=early_protection_fault,trapno=13
set_early_handler handler=early_page_fault,trapno=14
ret
这里省略了几个hander的实现代码,比较简单,基本就是停机了
.section .text
/*
* Real beginning of normal "text" segment
*/
ENTRY(stext)
ENTRY(_stext)
#注意一下上面的注释,然后一会儿再说
/*
* BSS section
*/
.section ".bss.page_aligned","wa"
.align PAGE_SIZE_asm
ENTRY(swapper_pg_dir)
.fill 1024,4,0
这里省略一点点东西
/*
* This starts the data section.
*/
.data
ENTRY(stack_start)
.long init_thread_union+THREAD_SIZE
.long __BOOT_DS
ready: .byte 0
early_recursion_flag:
.long 0
int_msg:
.asciz "Unknown interrupt or fault at EIP %p %p %p\n"
fault_msg:
.ascii "Int %d: CR2 %p err %p EIP %p CS %p flags %p\n"
.asciz "Stack: %p %p %p %p %p %p %p %p\n"
#include "../xen/xen-head.S"
/*
* The IDT and GDT ’descriptors’ are a strange 48-bit object
* only used by the lidt and lgdt instructions. They are not
* like usual segment descriptors - they consist of a 16-bit
* segment size, and 32-bit linear address value:
*/
.globl boot_gdt_descr
.globl idt_descr
ALIGN
# early boot GDT descriptor (must use 1:1 address mapping)
.word 0 # 32 bit align gdt_desc.address
boot_gdt_descr:
.word __BOOT_DS+7
.long boot_gdt - __PAGE_OFFSET
.word 0 # 32-bit align idt_desc.address
idt_descr:
.word IDT_ENTRIES*8-1 # idt contains 256 entries
.long idt_table
# boot GDT descriptor (later on used by CPU#0):
.word 0 # 32 bit align gdt_desc.address
ENTRY(early_gdt_descr)
.word GDT_ENTRIES*8-1
.long per_cpu__gdt_page /* Overwritten for secondary CPUs */
/*
* The boot_gdt must mirror the equivalent in setup.S and is
* used only for booting.
*/
.align L1_CACHE_BYTES #这个表示确保下面这些字节在最靠近CPU的同一个cacheline中,因为他的使用是非常频繁的,而L1 CACHE是几级中最少最宝贵的
ENTRY(boot_gdt)
.fill GDT_ENTRY_BOOT_CS,8,0
.quad 0x00cf9a000000ffff /* kernel 4GB code at 0x00000000 */
.quad 0x00cf92000000ffff /* kernel 4GB data at 0x00000000 */
五. 现代start_kernel
这是第一次跑出arch/i386目录了,他在linux/init/main.c,以下说的代码出了少量的都在linux/init/目录中。
start_kernel之后体系结构相关的就少了,由于这个函数里面的东东太多了,我也没怎么看,所以本来到这就没法再讲了。不过我还是试图讲一些我了解的内容,直到启动了init进程,略去start_kernel内核初始化的绝大多数代码。
顺便插入说一句,这个函数的代码被放在.text.init中,就向模块初始化函数相同,他仅仅被执行一次,内核在最后阶段会收回这个分节所占用的内存
void free_initmem(void)
{
free_init_pages("unused kernel memory",
(unsigned long)(&__init_begin),
(unsigned long)(&__init_end));
}
好了,说到start_kernel,他内核初始化之后,最后一行代码是
/* Do the rest non-__init’ed, we’re now alive */
rest_init();
他首先就是执行kernel_thread(kernel_init, NULL, CLONE_FS | CLONE_SIGHAND);
得到一个内核线程,然后最终执行cpu_idle();他基本就是节省机器的体力,谁要CPU就让给谁,所以就不说了,我们沿着流程走,接下来是kernel_init函数,关于内核线程,请参考一下ULK3第3章
static int __init kernel_init(void * unused)
{
/*一系列初始化*/
do_basic_setup();
/*
* check if there is an early userspace init. If yes, let it do all
* the work
*/
if (!ramdisk_execute_command)
ramdisk_execute_command = "/init";
if (sys_access((const char __user *) ramdisk_execute_command, 0) != 0) {
ramdisk_execute_command = NULL;
prepare_namespace();
}
/*
* Ok, we have completed the initial bootup, and
* we’re essentially up and running. Get rid of the
* initmem segments and start the user-mode stuff..
*/
init_post();
return 0;
}
//endof kernel_init
在一系列初始化之后
1.调用 do_basic_setup函数
2.然后就init进程的执行了
3.最后内核就处于不断等待服务的过程了。
所以接下来说do_basic_setup函数
/*
* Ok, the machine is now initialized. None of the devices
* have been touched yet, but the CPU subsystem is up and
* running, and memory and process management works.
*
* Now we can finally start doing some real work..
*/
static void __init do_basic_setup(void)
{
/* drivers will send hotplug events */
init_workqueues();
usermodehelper_init();
driver_init();
init_irq_proc();
do_initcalls();
}
他调用的前几个函数我就不解释了,说说后面的 do_initcalls();
首先说一下,内核中有一个专门的节,用来存放初始化末尾要被调用的函数
举个例子init/initramfs.c中的最后一句是
rootfs_initcall(populate_rootfs);
而
#define rootfs_initcall(fn) __define_initcall("rootfs",fn,rootfs)
#define __define_initcall(level,fn,id) \
static initcall_t __initcall_##fn##id __attribute_used__ \
__attribute__((__section__(".initcall" level ".init"))) = fn
能看出来这个populate_rootfs被放在.initcallrootfs.init中了。
然后呢,这个节在连接的时候会被output到.initcall.init节中(为了避免交叉过多,先不说了)
好了,在明白了initcall相关信息后,继续看看do_initcalls();
他的本质就是
for (call = __initcall_start; call
result = (*call)();
这样所有的initcall就会被调用了,这是为了避免把所有代码集合到一块的一个方法,虽然这带来了一个问题,谁在前谁在后?内核专门为这个大节分了一些类,哪些在前哪些在后连接的时候会安排的。
正好,目前知道 populate_rootfs会被执行,他处理initfamfs和initrd,首先是执行unpack_to_rootfs,然后检查是否有initrd. 代码就不列出来了。
再回到调用do_basic_setup()的kthread_init中
接下来
if (!ramdisk_execute_command)
ramdisk_execute_command = "/init";
if (sys_access((const char __user *) ramdisk_execute_command, 0) != 0) {
ramdisk_execute_command = NULL;
prepare_namespace();
}
这段代码检查是否有必要mount根文件系统,如果vmlinuz中带有initfamfs,而且其中已有init,那么就不这么做了(我目前工作用的目
标系统就是这样的,里面有个init),否则的话内核还要mount
init所在的(也是所有用户态进程的最除根文件系统)根文件系统,挂在根文件系统和执行init是linux启动过程最后要做的事情。
好了,就不说mount_root的细节了,prepare_namespace还是非常值得一看的,能去翻原始码。如果他失败了,就是panic
VFS no root found 这样的错误了。目前假设已有了根文件系统了,这样就到了kthread_init中的最后一条函数调用了
/* This is a non __init function. Force it to be noinline otherwise gcc
* makes it inline to init() and it becomes part of init.text section
*/
static int noinline init_post(void)
{
free_initmem(); //这就是上面说到的那个释放init代码的函数了,他显然不能__init
unlock_kernel();
mark_rodata_ro();
system_state = SYSTEM_RUNNING;
numa_default_policy();
if (sys_open((const char __user *) "/dev/console", O_RDWR, 0)
printk(KERN_WARNING "Warning: unable to open an initial console.\n");
(void) sys_dup(0);
(void) sys_dup(0);
if (ramdisk_execute_command) {
run_init_process(ramdisk_execute_command);
printk(KERN_WARNING "Failed to execute %s\n",
ramdisk_execute_command);
}
/*
* We try each of these until one succeeds.
*
* The Bourne shell can be used instead of init if we are
* trying to recover a really broken machine.
*/
if (execute_command) {
run_init_process(execute_command);
printk(KERN_WARNING "Failed to execute %s. Attempting "
"defaults...\n", execute_command);
}
run_init_process("/sbin/init");
run_init_process("/etc/init");
run_init_process("/bin/init");
run_init_process("/bin/sh");
panic("No init found. Try passing init= option to kernel.");
}
这个函数基本上就是执行init了,失败就panic了。
顺便说一句,/dev/console最后被012描述符引用,也就是所有没有reopen的进程的标准输入输出和出错。
到此,内核启动过程就完成了,init根据根文件系统的设置在初始化用户态的进程,启动系统
六. 连接
连接的过程是由arch/i386/kernel/vmlinux.lds.S和arch/i386/boot/setup.ld控制的
setup.ld
/*
* setup.ld
*
* Linker script for the i386 setup code
*/
OUTPUT_FORMAT("elf32-i386", "elf32-i386", "elf32-i386")
OUTPUT_ARCH(i386)
ENTRY(_start)
SECTIONS
{
. = 0;
.bstext : { *(.bstext) } header.S中的代码
.bsdata : { *(.bsdata) }
. = 497;
.header : { *(.header) } 刚好够512字节
.inittext : { *(.inittext) } header.S中的_start开始的代码
.initdata : { *(.initdata) }
.text : { *(.text*) } boot/中的C代码
. = ALIGN(16);
.rodata : { *(.rodata*) }
.videocards : {
video_cards = .;
*(.videocards)
video_cards_end = .;
}
. = ALIGN(16);
.data : { *(.data*) }
.signature : {
setup_sig = .; #记得有一条cmpl指令?
LONG(0x5a5aaa55)
}
. = ALIGN(16);
.bss :
{
__bss_start = .;
*(.bss)
__bss_end = .;
} setup中清空bss的代码引用了这两个符号
. = ALIGN(16);
_end = .;
/DISCARD/ : { *(.note*) }
. = ASSERT(_end
. = ASSERT(hdr == 0x1f1, "The setup header has the wrong offset!");
不能太长32K以下
}
vmlinux.lds.S就不全写出来了,举个例子刚才说到initcall段的安排
.initcall.init : AT(ADDR(.initcall.init) - LOAD_OFFSET) {
__initcall_start = .;
INITCALLS
__initcall_end = .;
}
#define INITCALLS \
#define INITCALLS \
a lot here \
*(.initcallrootfs.init) \
a lot here
内核更有非常多特别的节,比如
#define __EXPORT_SYMBOL(sym, sec) \
extern typeof(sym) sym; \
__CRC_SYMBOL(sym, sec) \
static const char __kstrtab_##sym[] \
__attribute__((section("__ksymtab_strings"))) \
= MODULE_SYMBOL_PREFIX #sym; \
static const struct kernel_symbol __ksymtab_##sym \
__attribute_used__ \
__attribute__((section("__ksymtab" sec), unused)) \
= { (unsigned long)&sym, __kstrtab_##sym }
#define EXPORT_SYMBOL(sym) \
__EXPORT_SYMBOL(sym, "")
#define EXPORT_SYMBOL_GPL(sym) \
__EXPORT_SYMBOL(sym, "_gpl")
用于导出符号,因为内核不象ld生成的可执行文件,他需要对别的文件进行重定位,所以需要这样的信息。上面的例子_GPL的符号节放在不同的节,所以再对一个模块中的符号进行重定位时,如果这个模块代码不按GPL发布,就不搜索__ksymtab_gpl节的符号了
七. 附录
参考:
linux/Documentation/i386/boot.txt
linux/Documentation/initrd.txt
linux/Documentation/kbuild/makefile.txt
ld.pdf gcc.pdf
ULK3 第2,3,4,9,12章,附录A,B
linux代码 intel 研发手册卷3A中的几十页
总结:分析的有点粗糙,自己水平有限
阅读(2483) | 评论(0) | 转发(1) |