------------------------------------------
一直都想对linux的启动流程做一个详细的研究.无奈项目一直很紧.一直到上个星期,才忙里偷闲,花时间了解了一下大概的流程.现将流程总结如下,希望能对初探者抛砖引玉,也欢迎各位指证.
一:bootload加载阶段
在嵌入式系统中,一般的环境初始化都是在bootload中完成的.由bootload完成基本硬件环境的初始化之后,会将kernel image加载到一个区域.而在x86中.开机之后的环境初始化是由bios提供的功能来完成的.然后跳转到活动分区对应的引导程序.
这里的kernel image加载是有讲究的.这要从kernel image的组成说起:
Linux的系统映像其实是一个引导层加上kernel代码映像构成.不妨去查看一下关于make bzimage的过程.它是通过linux-2.6.25/arch/x86/boot/tools/build.c生成的build工具,将linux-2.6.25/arch/x86/boot/head.s生成的文件将kenel压缩或者完全的映射联合在一起.
基于这样的特征,启动过程要从head.s部份跳转到kernel code部份,因此需要将kernel code加载到一个固定的地址.对于压缩的kernel.会加载到0x1000.对于完成的kernel.会将其加载到0x100000.
上面的流程,如下图所示:
另外: 对于head.S生成之后的文件也是有讲究的.它包含自带的一段启动程序和一段初始化代码.在bootload时代的今天,linux自带的启动程序是毫无用途的.而内核开发者也不想再维护引导的这一段启动代码.于是,如果用linux自带的代码的会,将就在屏幕上显示一个错误提示.这在我们后面的代码分析中可以看到.启动程序段位于head.S生成文件的前512K.bootload会跳转到加载位置的512偏移处开始执行.head.S的链接脚本内容如下:
linux-2.6.25/arch/x86/boot/setup.ld
1 /*
2 * setup.ld
3 *
4 * Linker script for the i386 setup code
5 */
6 OUTPUT_FORMAT("elf32-i386", "elf32-i386", "elf32-i386")
7 OUTPUT_ARCH(i386)
8 ENTRY(_start)
9
10 SECTIONS
11 {
12 . = 0;
13 .bstext : { *(.bstext) }
14 .bsdata : { *(.bsdata) }
15
16 . = 497;
17 .header : { *(.header) }
18 .inittext : { *(.inittext) }
19 .initdata : { *(.initdata) }
20 .text : { *(.text*) }
21
22 . = ALIGN(16);
23 .rodata : { *(.rodata*) }
24
25 .videocards : {
26 video_cards = .;
27 *(.videocards)
28 video_cards_end = .;
29 }
30
31 . = ALIGN(16);
32 .data : { *(.data*) }
33
34 .signature : {
35 setup_sig = .;
36 LONG(0x5a5aaa55)
37 }
38
39
40 . = ALIGN(16);
41 .bss :
42 {
43 __bss_start = .;
44 *(.bss)
45 __bss_end = .;
46 }
47 . = ALIGN(16);
48 _end = .;
49
50 /DISCARD/ : { *(.note*) }
51
52 . = ASSERT(_end <= 0x8000, "Setup too big!");
53 . = ASSERT(hdr == 0x1f1, "The setup header has the wrong offset!");
54 }
注意在16~17行.指明了header放在497的偏移处.
Head.S的部份代码如下:
linux-2.6.25/arch/x86/boot/head.S
……
.section ".header", "a"
.globl hdr
#下面的代码是初始化hdr的成员.程序的执行流程会在_start通过jump的机器码跳转出去
hdr:
setup_sects: .byte SETUPSECTS
root_flags: .word ROOT_RDONLY
syssize: .long SYSSIZE
ram_size: .word RAMDISK
vid_mode: .word SVGA_MODE
root_dev: .word ROOT_DEV
boot_flag: .word 0xAA55
# offset 512, entry point
#这里是偏移512字节的地方。bootload加载kernel之后的入口
.globl _start
_start:
# Explicitly enter this as bytes, or the assembler
# tries to generate a 3-byte jump here, which causes
# everything else to push off to the wrong offset.
#这里实际上是jmp的操作码.
.byte 0xeb # short (2-byte) jump
.byte start_of_setup-1f
……
上面说到.header的偏移是在497处.那512的偏移刚好到了_start.这也就是从bootload跳转进来的入口.顺带在这里提一下.hdr中存放的就是引导的各项参数.对应着一个struct setup_header 结构
在这里,_start通过jmp的操作码跳转到了start_of_setup.这样做是为了不破坏hdr中的其它成员初始化.
转到start_of_setup:
#跳转后的入口
start_of_setup:
#如果定义了SAFE_RESET_DISK_CONTROLLER 重启磁盘控制器
#ifdef SAFE_RESET_DISK_CONTROLLER
# Reset the disk controller.
movw $0x0000, %ax # Reset disk controller
movb $0x80, %dl # All disks
int $0x13
#endif
#设置es寄存器值为ds的内容
# Force %es = %ds
movw %ds, %ax
movw %ax, %es
cld
#因为接下来要call c fuction.先设置好堆栈
# Apparently some ancient versions of LILO invoked the kernel with %ss != %ds,
# which happened to work by accident for the old code. Recalculate the stack
# pointer if %ss is invalid. Otherwise leave it alone, LOADLIN sets up the
# stack behind its own code, so we can't blindly put it directly past the heap.
movw %ss, %dx
cmpw %ax, %dx # %ds == %ss?
movw %sp, %dx
je 2f # -> assume %sp is reasonably set
# Invalid %ss, make up a new stack
movw $_end, %dx
testb $CAN_USE_HEAP, loadflags
jz 1f
movw heap_end_ptr, %dx
1: addw $STACK_SIZE, %dx
jnc 2f
xorw %dx, %dx # Prevent wraparound
2: # Now %dx should point to the end of our stack space
andw $~3, %dx # dword align (might as well...)
jnz 3f
movw $0xfffc, %dx # Make sure we're not zero
3: movw %ax, %ss
movzwl %dx, %esp # Clear upper half of %esp
sti # Now we should have a working stack
# We will have entered with %cs = %ds+0x20, normalize %cs so
# it is on par with the other segments.
pushw %ds
pushw $6f
lretw
6:
#判断setup_sig 与$0x5a5aaa55是否相等,在link的时候,会将setup_sig设为$0x5a5aaa55
# Check signature at end of setup
cmpl $0x5a5aaa55, setup_sig
jne setup_bad
#清空BSS
# Zero the bss
movw $__bss_start, %di
movw $_end+3, %cx
xorl %eax, %eax
subw %di, %cx
shrw $2, %cx
rep; stosl
#跳转到main
# Jump to C code (should not return)
calll main
在这里,设置好了堆栈之后,call main,跳转到了用C写的函数里.在这个函数里会初始化一部份硬件环境.要注意的是,迄今为止.还一直运行在实模式.
Main的代码如下:
linux-2.6.25/arch/x86/boot/main.c
void main(void)
{
/* First, copy the boot header into the "zeropage" */
copy_boot_params();
/* End of heap check */
init_heap();
/* Make sure we have all the proper CPU support */
//验证CPU 是否有效
if (validate_cpu()) {
puts("Unable to boot - please use a kernel appropriate "
"for your CPU.\n");
die();
}
/* Tell the BIOS what CPU mode we intend to run in. */
//设置CPU的工作模式
set_bios_mode();
/* Detect memory layout */
//调用int 0x15 向bios 了解当前的内存布局
detect_memory();
/* Set keyboard repeat rate (why?) */
keyboard_set_repeat();
/* Query MCA information */
//检查IBM 微通道总线
query_mca();
/* Voyager */
#ifdef CONFIG_X86_VOYAGER
query_voyager();
#endif
/* Query Intel SpeedStep (IST) information */
query_ist();
/* Query APM information */
#if defined(CONFIG_APM) || defined(CONFIG_APM_MODULE)
query_apm_bios();
#endif
/* Query EDD information */
#if defined(CONFIG_EDD) || defined(CONFIG_EDD_MODULE)
query_edd();
#endif
/* Set the video mode */
set_video();
/* Do the last things and invoke protected mode */
//通过这个跳转到保护模式了
go_to_protected_mode();
}
在copy_boot_params()中,会将hdr的值copy到一个全局变量boot_params中.如下所示:
static void copy_boot_params(void)
{
……
memcpy(&boot_params.hdr, &hdr, sizeof hdr);
}
在detect_memory()中会调用0x15完成对内存的初步探测.并将其保存在boot_params.e820_map
Main()最终会调用go_to_protected_mode().从字面意思可以看出,这个函数会将其转换到保护模式.
void go_to_protected_mode(void)
{
/* Hook before leaving real mode, also disables interrupts */
//禁用中断
realmode_switch_hook();
/* Move the kernel/setup to their final resting places */
//移动kernel到0x10000
move_kernel_around();
/* Enable the A20 gate */
//置位键盘的a20 引脚
if (enable_a20()) {
puts("A20 gate not responding, unable to boot...\n");
die();
}
/* Reset coprocessor (IGNNE#) */
//重置协处理器
reset_coprocessor();
/* Mask all interrupts in the PIC */
//在pic中屏弊掉所有中断
mask_all_interrupts();
/* Actual transition to protected mode... */
//建立临时的idt 和gdt
//将IDT清空
setup_idt();
setup_gdt();
//跳转到内核的起点处```即header.S
//TODO: 对于压缩的kernel来说,这里还是从0x1000处运行,并没有跳转到搬移中运行
protected_mode_jump(boot_params.hdr.code32_start,
(u32)&boot_params + (ds() << 4));
}
首先,会调用setup_idt()和setup_gdt()建立一个临时的IDT和GDT.代码如下:
static void setup_idt(void)
{
static const struct gdt_ptr null_idt = {0, 0};
asm volatile("lidtl %0" : : "m" (null_idt));
}
可以看到,这个临时的IDT是空的.
static void setup_gdt(void)
{
/* There are machines which are known to not boot with the GDT
being 8-byte unaligned. Intel recommends 16 byte alignment. */
static const u64 boot_gdt[] __attribute__((aligned(16))) = {
/* CS: code, read/execute, 4 GB, base 0 */
[GDT_ENTRY_BOOT_CS] = GDT_ENTRY(0xc09b, 0, 0xfffff),
/* DS: data, read/write, 4 GB, base 0 */
[GDT_ENTRY_BOOT_DS] = GDT_ENTRY(0xc093, 0, 0xfffff),
/* TSS: 32-bit tss, 104 bytes, base 4096 */
/* We only have a TSS here to keep Intel VT happy;
we don't actually use it for anything. */
[GDT_ENTRY_BOOT_TSS] = GDT_ENTRY(0x0089, 4096, 103),
};
/* Xen HVM incorrectly stores a pointer to the gdt_ptr, instead
of the gdt_ptr contents. Thus, make it static so it will
stay in memory, at least long enough that we switch to the
proper kernel GDT. */
static struct gdt_ptr gdt;
gdt.len = sizeof(boot_gdt)-1;
gdt.ptr = (u32)&boot_gdt + (ds() << 4);
asm volatile("lgdtl %0" : : "m" (gdt));
}
在这里看到.GDT初始化了三项. GDT_ENTRY_BOOT_CS, GDT_ENTRY_BOOT_DS和GDT_ENTRY_BOOT_TSS.其中GDT_ENTRY_BOOT_CS和GDT_ENTRY_BOOT_DS基地址都为零.段限长都是4G. 实际上GDT_ENTRY_BOOT_TSS是没有被使用到的
具体从实模式到保护模式的切换是在protected_mode_jump中完成的.代码如下:
linux-2.6.25/arch/x86/boot/ pmjump.S
protected_mode_jump:
#edx:存放第二个参数,即bootparams
movl %edx, %esi # Pointer to boot_params table
xorl %ebx, %ebx
movw %cs, %bx
shll $4, %ebx
addl %ebx, 2f
#设置CX -> __BOOT_DS , di -> __BOOT_TSS
movw $__BOOT_DS, %cx
movw $__BOOT_TSS, %di
#将CR0 的PE位置1. 开启了保护模式
movl %cr0, %edx
orb $X86_CR0_PE, %dl # Protected mode
movl %edx, %cr0
jmp 1f # Short jump to serialize on 386/486
1:
# Transition to 32-bit mode
.byte 0x66, 0xea # ljmpl opcode
2: .long in_pm32 # offset
.word __BOOT_CS # segment
.size protected_mode_jump, .-protected_mode_jump
.code32
.type in_pm32, @function
in_pm32:
# Set up data segments for flat 32-bit mode
#设置段寄存器
movl %ecx, %ds
movl %ecx, %es
movl %ecx, %fs
movl %ecx, %gs
movl %ecx, %ss
# The 32-bit code sets up its own stack, but this way we do have
# a valid stack if some debugging hack wants to use it.
addl %ebx, %esp
# Set up TR to make Intel VT happy
ltr %di
# Clear registers to allow for future extensions to the
# 32-bit boot protocol
#清除普通寄存器
xorl %ecx, %ecx
xorl %edx, %edx
xorl %ebx, %ebx
xorl %ebp, %ebp
xorl %edi, %edi
# Set up LDTR to make Intel VT happy
lldt %cx
//跳转到指定的入口了
jmpl *%eax # Jump to the 32-bit entrypoint
首先protected_mode_jump函数是用寄存器来传值的,第一个参数放eax,第二个参数在edx中.
这个函数的两个参数如下示:
protected_mode_jump(boot_params.hdr.code32_start,
(u32)&boot_params + (ds() << 4));
一个是转换到保护模式下要跳转到的地址,在压缩的情况下,这个值是0x1000.末压缩情况下,这个值是0x10000.另一个值是引导参数.
这个函数在将引导参数移到esi后,置位段寄存器,清空普通寄存器,然后主跳转到了指定的位置.
这个位置是在arch/boot/kernel/head_32.S
在分析这段代码之后,我们先来看下它的链接脚本:
linux-2.6.25/arch/x86/kernel/vmlinux_32.lds.S
#define LOAD_OFFSET __PAGE_OFFSET
#include
#include
#include
#include
OUTPUT_FORMAT("elf32-i386", "elf32-i386", "elf32-i386")
OUTPUT_ARCH(i386)
ENTRY(phys_startup_32)
jiffies = jiffies_64;
PHDRS {
text PT_LOAD FLAGS(5); /* R_E */
data PT_LOAD FLAGS(7); /* RWE */
note PT_NOTE FLAGS(0); /* ___ */
}
SECTIONS
{
. = LOAD_OFFSET + LOAD_PHYSICAL_ADDR;
phys_startup_32 = startup_32 - LOAD_OFFSET;
.text.head : AT(ADDR(.text.head) - LOAD_OFFSET) {
_text = .; /* Text and read-only data */
*(.text.head)
} :text = 0x9090
……
……
所有的SECTIONS是从LOAD_OFFSET + LOAD_PHYSICAL_ADDR开始的. LOAD_OFFSET就是我们经常看到的PAGE_OFFSET. LOAD_PHYSICAL_ADDR在没有压缩kernel的情况就是0x100000.这也就是kernel线性地址到物理地址转换关系的由来.
接着看arch/boot/kernel/head_32.S的代码:
由于该代码篇幅较长,分段分析如下,省略掉了选择编译的部份.
# 低端页面总数 1<<32 / 1<<12
LOW_PAGES = 1<<(32-PAGE_SHIFT_asm)
/*
* To preserve the DMA pool in PAGEALLOC kernels, we'll allocate
* pagetables from above the 16MB DMA limit, so we'll have to set
* up pagetables 16MB more (worst-case):
*/
#ifdef CONFIG_DEBUG_PAGEALLOC
LOW_PAGES = LOW_PAGES + 0x1000000
#endif
#if PTRS_PER_PMD > 1
#PTD和PMD所占空间
PAGE_TABLE_SIZE = (LOW_PAGES / PTRS_PER_PMD) + PTRS_PER_PGD
#else
#PTD所占的页面数(每个PTE占一个页面)
PAGE_TABLE_SIZE = (LOW_PAGES / PTRS_PER_PGD)
#endif
//用位来表示页面的数组大小
BOOTBITMAP_SIZE = LOW_PAGES / 8
ALLOCATOR_SLOP = 4
//总共所占空间的大小
INIT_MAP_BEYOND_END = BOOTBITMAP_SIZE + (PAGE_TABLE_SIZE + ALLOCATOR_SLOP)*PAGE_SIZE_asm
这部份计算页面位图与PTE,PMD所占空间.在这里之所以不要保存PGD所占空间是因为,PGD的区域是在kernel中链接的时候指定的,属于静太区域
#重新设置GDT.之所以重新配置,是因为整个vmlinux是从__PAGE_OFFSET偏移安放的
lgdt pa(boot_gdt_descr)
movl $(__BOOT_DS),%eax
movl %eax,%ds
movl %eax,%es
movl %eax,%fs
movl %eax,%gs
2:
/*
* Clear BSS first so that there are no surprises...
*/
#清空BSS
cld
xorl %eax,%eax
movl $pa(__bss_start),%edi
movl $pa(__bss_stop),%ecx
subl %edi,%ecx
shrl $2,%ecx
rep ; stosl
在这里重新设置GDT,清空BSS段
#esi中已经存放了boot_parms的值
movl $pa(boot_params),%edi
movl $(PARAM_SIZE/4),%ecx
cld
rep
//将esi中的值copy到edi中,也就是boot_params对应的内存空间处
movsl
movl pa(boot_params) + NEW_CL_POINTER,%esi
andl %esi,%esi
jz 1f # No comand line
movl $pa(boot_command_line),%edi
movl $(COMMAND_LINE_SIZE/4),%ecx
rep
movsl
将引导参数保存到boot_params.将command_line保存到boot_command_line
#没有配置PAE
#__PAGE_OFFSET对应的页目录索引
page_pde_offset = (__PAGE_OFFSET >> 20);
#pg0:临时的页表项。映射前面4M 内存空间大小
movl $pa(pg0), %edi
movl $pa(swapper_pg_dir), %edx
movl $PTE_ATTR, %eax
10:
#edi中存放了pg0的地址。PDE_ATTR(%edi):会成生一个PDE项
leal PDE_ATTR(%edi),%ecx /* Create PDE entry */
#将生成的PDE设为PGD的第0项
movl %ecx,(%edx) /* Store identity PDE entry */
#将生成的PDE设为PGD的page_pde_offset项即0x300
movl %ecx,page_pde_offset(%edx) /* Store kernel PDE entry */
#即edx 指向pgd的第二项
addl $4,%edx
#接下来就是设置pg0的值了.
#设置循环次数
movl $1024, %ecx
11:
#将eax-> edi . edi存放的是pg0的地址
stosl
addl $0x1000,%eax #eax = eax +0x1000 (0x1000 = 4K)
loop 11b
#经过上面的循环之后,pg0中的内容依次被设置为:0x007, 0x1007,0x2007...0x3FF007
#这次从线性地址0开始的第一个PGD项和从__PAGE_OFFSET开始的第一个PGD都可以对前4M 进行寻址了
/*
* End condition: we must map up to and including INIT_MAP_BEYOND_END
* bytes beyond the end of our own page tables; the +0x007 is
* the attribute bits
*/
# 注意这里要一直映射到INIT_MAP_BEYOND_END
leal (INIT_MAP_BEYOND_END+PTE_ATTR)(%edi),%ebp
#判断INIT_MAP_BEYOND_END是否有映射,如果没有映射关系,就跳转到10.建立映射关系
cmpl %ebp,%eax
jb 10b
#将最后的页表项存入init_pg_tables_end
movl %edi,pa(init_pg_tables_end)
/* Do early initialization of the fixmap area */
//为fixmap建立映射关系
movl $pa(swapper_pg_fixmap)+PDE_ATTR,%eax
movl %eax,pa(swapper_pg_dir+0xffc)
在这里,初始化映射区
//开启分页
movl $pa(swapper_pg_dir),%eax
movl %eax,%cr3 /* set the page table pointer.. */
movl %cr0,%eax
orl $X86_CR0_PG,%eax
movl %eax,%cr0 /* ..and set paging (PG) bit */
ljmp $__BOOT_CS,$1f /* Clear prefetch and normalize %eip */
1:
/* Set up the stack pointer */
//建立内核态堆栈
lss stack_start,%esp
在这里将stack_start作为堆栈段,也就是对应系统第一个kernel进程
//建立idt
call setup_idt
……
……
//跳转到start_kernel
jmp start_kernel
之后跳转到start_kernel中,完成了第一阶段的启动
二:第二启动阶段
第二启动阶段也即start_kernel()阶段.在这个阶段.会进行更加具体而全面的系统初始化. 在这个阶段里,我们主要分析内存管理的初始化.这部份是最重要也是最繁杂的部份.我们从start_kernel()中摘取与内存管理相关的子函数进行分析.
第一个要分析的函数是setup_arch().这是每个平台的初始化.代码如下:
Setup_arch()中与内存管理相关的函数如下所示:
void __init setup_arch(char **cmdline_p)
{
//ioremap映射区域的初始化
early_ioremap_init();
…….
//调整e820 位图并将其打印出来
// 对bios取得的e820图进行调整,然后将其copy 到e820
print_memory_map(memory_setup());
……
//max_pfn: 最大的页面号
find_max_pfn();
……
//返回内核所能映射的最大页面数
max_low_pfn = setup_memory();
……
paging_init();
zone_sizes_init();
}
Setup_arch() --à early_ioremap_init()代码如下:
void __init early_ioremap_init(void)
{
pmd_t *pmd;
if (early_ioremap_debug)
printk(KERN_INFO "early_ioremap_init()\n");
pmd = early_ioremap_pmd(fix_to_virt(FIX_BTMAP_BEGIN));
/*在这里会将FIX_BTMAP_BEGIN 段的页面表固定使用bm_pte*/
memset(bm_pte, 0, sizeof(bm_pte));
pmd_populate_kernel(&init_mm, pmd, bm_pte);
/*
* The boot-ioremap range spans multiple pmds, for which
* we are not prepared:
*/
if (pmd != early_ioremap_pmd(fix_to_virt(FIX_BTMAP_END))) {
WARN_ON(1);
printk(KERN_WARNING "pmd %p != %p\n",
pmd, early_ioremap_pmd(fix_to_virt(FIX_BTMAP_END)));
printk(KERN_WARNING "fix_to_virt(FIX_BTMAP_BEGIN): %08lx\n",
fix_to_virt(FIX_BTMAP_BEGIN));
printk(KERN_WARNING "fix_to_virt(FIX_BTMAP_END): %08lx\n",
fix_to_virt(FIX_BTMAP_END));
printk(KERN_WARNING "FIX_BTMAP_END: %d\n", FIX_BTMAP_END);
printk(KERN_WARNING "FIX_BTMAP_BEGIN: %d\n",
FIX_BTMAP_BEGIN);
}
}
上面的这段代码,使FIX_BTMAP_BEGIN为起始地址对应的一个PMD对应映射的地址区间.即固定映射到bm_pte.
细心的读者可以发现了.从FIX_BTMAP_BEGIN开始的一个PMD映射区间对应就是永久内存映射的线性地址段.没错,就是它.
一般说来,永久内存映射地址段只在一个PMD范围内.若有超出一个PMD.则打印出警告信息.
Setup_arch() --à print_memory_map(memory_setup());
Memory_setup():我们在讲述启动的第一阶段的时候曾分析到.内核调用int 0x15取得内存信息,然后保存在boot_params.e820_map中.有时候bios提供的映射信息也并不一定正确,比如有些地方会重复.所以.在这里函数里对bios取得的信息进行正确调整,然后将其保存到全局变量e820中.
E820的定义如下:
struct e820entry {
//内存图起始地址
__u64 addr; /* start of memory segment */
//内存图大小
__u64 size; /* size of memory segment */
//内存图类型
__u32 type; /* type of memory segment */
} __attribute__((packed));
struct e820map {
//内存图总项数
__u32 nr_map;
//内存项数组
struct e820entry map[E820MAX];
};
Prit_memory_map()则讲e820中的信息打印出来.就这是我们在开机的时候看到有e820映射图.代码如下所示:
void __init print_memory_map(char *who)
{
int i;
for (i = 0; i < e820.nr_map; i++) {
printk(" %s: %016Lx - %016Lx ", who,
e820.map[i].addr,
e820.map[i].addr + e820.map[i].size);
switch (e820.map[i].type) {
case E820_RAM: printk("(usable)\n");
break;
case E820_RESERVED:
printk("(reserved)\n");
break;
case E820_ACPI:
printk("(ACPI data)\n");
break;
case E820_NVS:
printk("(ACPI NVS)\n");
break;
default: printk("type %u\n", e820.map[i].type);
break;
}
}
}
Setup_arch() --à find_max_pfn():找到最大的物理页面号
void __init find_max_pfn(void)
{
int i;
max_pfn = 0;
for (i = 0; i < e820.nr_map; i++) {
unsigned long start, end;
/* RAM? */
if (e820.map[i].type != E820_RAM)
continue;
start = PFN_UP(e820.map[i].addr);
end = PFN_DOWN(e820.map[i].addr + e820.map[i].size);
if (start >= end)
continue;
if (end > max_pfn)
max_pfn = end;
memory_present(0, start, end);
}
该函数比较简单,就是搜索e820图中的可用内存的最高页面号.
PFN_UP() PFN_DWON()定义如下:
#define PFN_UP(x) (((x) + PAGE_SIZE-1) >> PAGE_SHIFT)
#define PFN_DOWN(x) ((x) >> PAGE_SHIFT)
两者的区别是:PFN_DOWN():线性地址向下取页面号.而PFN_UP()是向上取页面号.
Setup_arch() --à setup_memory():
static unsigned long __init setup_memory(void)
{
/*
* partially used pages are not usable - thus
* we are rounding upwards:
*/
//init_pg_tables_end: 映射之后的最高页表项地址
//min_low_pfn: 映射的起始页面号
min_low_pfn = PFN_UP(init_pg_tables_end);
//max_low_pfn: 内核能见的最高页面号
max_low_pfn = find_max_low_pfn();
#ifdef CONFIG_HIGHMEM
highstart_pfn = highend_pfn = max_pfn;
if (max_pfn > max_low_pfn) {
highstart_pfn = max_low_pfn;
}
printk(KERN_NOTICE "%ldMB HIGHMEM available.\n",
pages_to_mb(highend_pfn - highstart_pfn));
num_physpages = highend_pfn;
high_memory = (void *) __va(highstart_pfn * PAGE_SIZE - 1) + 1;
#else
num_physpages = max_low_pfn;
high_memory = (void *) __va(max_low_pfn * PAGE_SIZE - 1) + 1;
#endif
#ifdef CONFIG_FLATMEM
max_mapnr = num_physpages;
#endif
printk(KERN_NOTICE "%ldMB LOWMEM available.\n",
pages_to_mb(max_low_pfn));
setup_bootmem_allocator();
return max_low_pfn;
}
我们在分析第一阶段启动的时候,分析过init_pg_tables_end.该值对应最初映射的页表项末地址.注意以下几个全局变量:
num_physpages: 总共的物理页面数
high_memory:高端内存的起始线性地址
min_low_pfn: 可供使用最低页面号
max_low_pfn: 低端内存的最高页表号.也就是内核可直接使用内存的最高页面号
函数find_max_low_pfn()找到内核直接映射的最高页面号.在x86 32位平台上,内核直接映射的区域为0~896M.其它部份做高端内存映射用.
参考以下定义:
#define MAXMEM_PFN PFN_DOWN(MAXMEM)
#define MAXMEM (-__PAGE_OFFSET-__VMALLOC_RESERVE)
MAXMEM_PFN即为内核可用的最高物理页面号.
接下来setup_memory()会初始化内核启动阶段的分配器了.这个分配器只在kernel初始化的时候才会用到.
见setup_bootmem_allocator()的代码:
void __init setup_bootmem_allocator(void)
{
unsigned long bootmap_size;
/*
* Initialize the boot-time allocator (with low memory only):
*/
//初始化bootmem
bootmap_size = init_bootmem(min_low_pfn, max_low_pfn);
register_bootmem_low_pages(max_low_pfn);
/*
* Reserve the bootmem bitmap itself as well. We do this in two
* steps (first step was init_bootmem()) because this catches
* the (very unlikely) case of us accidentally initializing the
* bootmem allocator with an invalid RAM area.
*/
//保留的内存
//将bootmem 的位图所占区域保存.注意这个bootmem位图所占区域的起始地址就是init_pg_tables_end
reserve_bootmem(__pa_symbol(_text), (PFN_PHYS(min_low_pfn) +
bootmap_size + PAGE_SIZE-1) - __pa_symbol(_text),
BOOTMEM_DEFAULT);
/*
* reserve physical page 0 - it's a special BIOS page on many boxes,
* enabling clean reboots, SMP operation, laptop functions.
*/
//最低的一个页面预以保留
reserve_bootmem(0, PAGE_SIZE, BOOTMEM_DEFAULT);
/* reserve EBDA region, it's a 4K region */
//EBDA的保留
reserve_ebda_region();
//对于AMD CPU特定区域的保留
/* could be an AMD 768MPX chipset. Reserve a page before VGA to prevent
PCI prefetch into it (errata #56). Usually the page is reserved anyways,
unless you have no PS/2 mouse plugged in. */
if (boot_cpu_data.x86_vendor == X86_VENDOR_AMD &&
boot_cpu_data.x86 == 6)
reserve_bootmem(0xa0000 - 4096, 4096, BOOTMEM_DEFAULT);
#ifdef CONFIG_SMP
/*
* But first pinch a few for the stack/trampoline stuff
* FIXME: Don't need the extra page at 4K, but need to fix
* trampoline before removing it. (see the GDT stuff)
*/
//对于SMP来说,保留4K处的PAGE_SIZE大小的空间
reserve_bootmem(PAGE_SIZE, PAGE_SIZE, BOOTMEM_DEFAULT);
#endif
#ifdef CONFIG_ACPI_SLEEP
/*
* Reserve low memory region for sleep support.
*/
acpi_reserve_bootmem();
#endif
#ifdef CONFIG_X86_FIND_SMP_CONFIG
/*
* Find and reserve possible boot-time SMP configuration:
*/
find_smp_config();
#endif
#ifdef CONFIG_BLK_DEV_INITRD
reserve_initrd();
#endif
numa_kva_reserve();
reserve_crashkernel();
}
init_bootmem()用来初始化bootmem.
register_bootmem_low_pages()用来在bootmem中登记可用的物理内存
reserve_bootmem()用来设置bootmem的保留内存.在分配内存的时候不会将这部份内存分配出去.
Init_bootmem()的代码如下所示:
unsigned long __init init_bootmem(unsigned long start, unsigned long pages)
{
max_low_pfn = pages;
min_low_pfn = start;
return init_bootmem_core(NODE_DATA(0), start, 0, pages);
}
NODE_DATA()用来寻找指定序列的节点.如果没有打开CONFIG_NUMA.节点只有一个,即为contig_page_data
#define NODE_DATA(nid) (&contig_page_data)
Init_bootmem_core():
static unsigned long __init init_bootmem_core(pg_data_t *pgdat,
unsigned long mapstart, unsigned long start, unsigned long end)
{
bootmem_data_t *bdata = pgdat->bdata;
unsigned long mapsize;
//分配位图的地址
bdata->node_bootmem_map = phys_to_virt(PFN_PHYS(mapstart));
//分配的起始地址
bdata->node_boot_start = PFN_PHYS(start);
//最高页面号
bdata->node_low_pfn = end;
//将bdata加到一个全局变量
link_bootmem(bdata);
/*
* Initially all pages are reserved - setup_arch() has to
* register free RAM areas explicitly.
*/
//映射区域的大小
mapsize = get_mapsize(bdata);
//将分配位图全部设为1
memset(bdata->node_bootmem_map, 0xff, mapsize);
return mapsize;
}
综合上面的几段代码得知.bootmem在初始化阶段.将分配位图保存在min_low_pfn中.从0开始到max_low_pfn的内存页面对应位图项都将置为了1.表示页面都不可用.
接着boot_mem调用register_bootmem_low_pages()在bootmem注册可用供bootmem分配的内存.代码如下:
//将有效内存在分配位图中置为空
void __init register_bootmem_low_pages(unsigned long max_low_pfn)
{
int i;
for (i = 0; i < e820.nr_map; i++) {
unsigned long curr_pfn, last_pfn, size;
/*
* Reserve usable low memory
*/
if (e820.map[i].type != E820_RAM)
continue;
/*
* We are rounding up the start address of usable memory:
*/
curr_pfn = PFN_UP(e820.map[i].addr);
if (curr_pfn >= max_low_pfn)
continue;
/*
* ... and at the end of the usable range downwards:
*/
last_pfn = PFN_DOWN(e820.map[i].addr + e820.map[i].size);
if (last_pfn > max_low_pfn)
last_pfn = max_low_pfn;
/*
* .. finally, did all the rounding and playing
* around just make the area go away?
*/
if (last_pfn <= curr_pfn)
continue;
size = last_pfn - curr_pfn;
//释放掉curr_pfh -> last_pfn段的内存.对应将分配位图中的相关位置0
free_bootmem(PFN_PHYS(curr_pfn), PFN_PHYS(size));
}
}
该函数将e820位图中可用物理内存在bootmem中对应的分配位图全置为0.
到这里为止,全部物理内存都可供bootmem分配了.但是有些内存是需要保存的.例如kernel映射所占的内存.如果这部份内存都分配出去了,那系统肯定是会崩溃的.
reserve_bootmem()用来在bootmem中设置保留的页面项.该操作实际上是将页面在bootmem对应序号置为0.
Setup_arch() -à paging_init()用来初始化分页机制.实际上在启动的第一阶段已经分配了一小部份的页面映射.在这里,会进行全面的初始化
void __init paging_init(void)
{
#ifdef CONFIG_X86_PAE
set_nx();
if (nx_enabled)
printk(KERN_INFO "NX (Execute Disable) protection: active\n");
#endif
//页面初始化
pagetable_init();
load_cr3(swapper_pg_dir);
__flush_tlb_all();
//初始临时映射区域
kmap_init();
}
pagetable_init()中,会将在4G线性空间中,物理地址不存在的部份和内核不可直接使用部份对应的映射关系清空.然后按照PAGE_OFFSET偏移关系建立映射关系,最后再为高端内存建立好页面表.
然后load_cr3(swapper_pg_dir)将swapper_pg_dir再次加载到CR3.使刚才改变的页面映射关系生效.
Kmap_init()主要进行永久内存映射的初始化.即取得永久内存映射对应的起始页表项.即kmap_pte
Setup_arch()-àzone_sizes_init()用来对zone区进行初始化.代码如下示:
void __init zone_sizes_init(void)
{
//设置各个区的结束页号
unsigned long max_zone_pfns[MAX_NR_ZONES];
memset(max_zone_pfns, 0, sizeof(max_zone_pfns));
max_zone_pfns[ZONE_DMA] =
virt_to_phys((char *)MAX_DMA_ADDRESS) >> PAGE_SHIFT;
max_zone_pfns[ZONE_NORMAL] = max_low_pfn;
#ifdef CONFIG_HIGHMEM
max_zone_pfns[ZONE_HIGHMEM] = highend_pfn;
add_active_range(0, 0, highend_pfn);
#else
add_active_range(0, 0, max_low_pfn);
#endif
free_area_init_nodes(max_zone_pfns);
}
Add_acive_range()的代码在单CPU平台上相当于:
void __init add_active_range(unsigned int nid, unsigned long start_pfn,
unsigned long end_pfn)
{
……
……
early_node_map[i].nid = nid;
early_node_map[i].start_pfn = start_pfn;
early_node_map[i].end_pfn = end_pfn;
nr_nodemap_entries = i + 1;
}
即在early_node_map添加了一项,.起始页面号是0.结束页面号是最高物理页面号.
接下来看free_area_init_nodes()的执行过程:
void __init free_area_init_nodes(unsigned long *max_zone_pfn)
{
unsigned long nid;
enum zone_type i;
/* Sort early_node_map as initialisation assumes it is sorted */
sort_node_map();
//以下代码就是为了建立arch_zone_lowest_possible_pfn[i] ~ arch_zone_highest_possible_pfn[i]
//对立第i个zone区的起始页面号和最高页面号
memset(arch_zone_lowest_possible_pfn, 0,
sizeof(arch_zone_lowest_possible_pfn));
memset(arch_zone_highest_possible_pfn, 0,
sizeof(arch_zone_highest_possible_pfn));
arch_zone_lowest_possible_pfn[0] = find_min_pfn_with_active_regions();
arch_zone_highest_possible_pfn[0] = max_zone_pfn[0];
for (i = 1; i < MAX_NR_ZONES; i++) {
if (i == ZONE_MOVABLE)
continue;
arch_zone_lowest_possible_pfn[i] =
arch_zone_highest_possible_pfn[i-1];
arch_zone_highest_possible_pfn[i] =
max(max_zone_pfn[i], arch_zone_lowest_possible_pfn[i]);
}
arch_zone_lowest_possible_pfn[ZONE_MOVABLE] = 0;
arch_zone_highest_possible_pfn[ZONE_MOVABLE] = 0;
/* Find the PFNs that ZONE_MOVABLE begins at in each node */
memset(zone_movable_pfn, 0, sizeof(zone_movable_pfn));
find_zone_movable_pfns_for_nodes(zone_movable_pfn);
/* Print out the zone ranges */
printk("Zone PFN ranges:\n");
for (i = 0; i < MAX_NR_ZONES; i++) {
if (i == ZONE_MOVABLE)
continue;
printk(" %-8s %8lu -> %8lu\n",
zone_names[i],
arch_zone_lowest_possible_pfn[i],
arch_zone_highest_possible_pfn[i]);
}
/* Print out the PFNs ZONE_MOVABLE begins at in each node */
printk("Movable zone start PFN for each node\n");
for (i = 0; i < MAX_NUMNODES; i++) {
if (zone_movable_pfn[i])
printk(" Node %d: %lu\n", i, zone_movable_pfn[i]);
}
/* Print out the early_node_map[] */
printk("early_node_map[%d] active PFN ranges\n", nr_nodemap_entries);
for (i = 0; i < nr_nodemap_entries; i++)
printk(" %3d: %8lu -> %8lu\n", early_node_map[i].nid,
early_node_map[i].start_pfn,
early_node_map[i].end_pfn);
/* Initialise every node */
setup_nr_node_ids();
for_each_online_node(nid) {
pg_data_t *pgdat = NODE_DATA(nid);
free_area_init_node(nid, pgdat, NULL,
find_min_pfn_for_node(nid), NULL);
/* Any memory on that node */
if (pgdat->node_present_pages)
node_set_state(nid, N_HIGH_MEMORY);
check_for_regular_memory(pgdat);
}
}
free_area_init_node()是一个比较复杂的函数.它的代码如下:
void __paginginit free_area_init_node(int nid, struct pglist_data *pgdat,
unsigned long *zones_size, unsigned long node_start_pfn,
unsigned long *zholes_size)
{
pgdat->node_id = nid;
pgdat->node_start_pfn = node_start_pfn;
//计算pgdat的总共页面数和可用页面数
calculate_node_totalpages(pgdat, zones_size, zholes_size);
alloc_node_mem_map(pgdat);
free_area_init_core(pgdat, zones_size, zholes_size);
}
calculate_node_totalpages()用来计算节点的实际页面数和可用页面数.分别存放在pgdat->node_spanned_pages和pgdat->node_present_pages.
alloc_node_mem_map(pgdat)用来为节点中的页面建立描述符.
static void __init_refok alloc_node_mem_map(struct pglist_data *pgdat)
{
/* Skip empty nodes */
if (!pgdat->node_spanned_pages)
return;
#ifdef CONFIG_FLAT_NODE_MEM_MAP
/* ia64 gets its own node_mem_map, before this, without bootmem */
if (!pgdat->node_mem_map) {
unsigned long size, start, end;
struct page *map;
/*
* The zone's endpoints aren't required to be MAX_ORDER
* aligned but the node_mem_map endpoints must be in order
* for the buddy allocator to function correctly.
*/
start = pgdat->node_start_pfn & ~(MAX_ORDER_NR_PAGES - 1);
end = pgdat->node_start_pfn + pgdat->node_spanned_pages;
end = ALIGN(end, MAX_ORDER_NR_PAGES);
size = (end - start) * sizeof(struct page);
map = alloc_remap(pgdat->node_id, size);
if (!map)
map = alloc_bootmem_node(pgdat, size);
//pgdat中的起始page
pgdat->node_mem_map = map + (pgdat->node_start_pfn - start);
}
#ifndef CONFIG_NEED_MULTIPLE_NODES
/*
* With no DISCONTIG, the global mem_map is just set as node 0's
*/
if (pgdat == NODE_DATA(0)) {
mem_map = NODE_DATA(0)->node_mem_map;
#ifdef CONFIG_ARCH_POPULATES_NODE_MAP
if (page_to_pfn(mem_map) != pgdat->node_start_pfn)
mem_map -= (pgdat->node_start_pfn - ARCH_PFN_OFFSET);
#endif /* CONFIG_ARCH_POPULATES_NODE_MAP */
}
#endif
#endif /* CONFIG_FLAT_NODE_MEM_MAP */
}
它根据结点中映射的页面数目大小分配相应大小的page数组.如果是第一个结点的话,会将其page数组描述符赋值给mem_map.这也是mem_map的由来.
随后,调用free_area_init_core()进行进一步的初始化.这个函数会比节点中的zone区进行一系列初始化,我们来关注一下页面对应page结构的初始化.
free_area_init_core()-àinit_currently_empty_zone()-àmemmap_init():
#define memmap_init(size, nid, zone, start_pfn) \
memmap_init_zone((size), (nid), (zone), (start_pfn), MEMMAP_EARLY)
void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
unsigned long start_pfn, enum memmap_context context)
{
struct page *page;
unsigned long end_pfn = start_pfn + size;
unsigned long pfn;
for (pfn = start_pfn; pfn < end_pfn; pfn++) {
/*
* There can be holes in boot-time mem_map[]s
* handed to this function. They do not
* exist on hotplugged memory.
*/
if (context == MEMMAP_EARLY) {
if (!early_pfn_valid(pfn))
continue;
if (!early_pfn_in_nid(pfn, nid))
continue;
}
page = pfn_to_page(pfn);
set_page_links(page, zone, nid, pfn);
init_page_count(page);
reset_page_mapcount(page);
SetPageReserved(page);
/*
* Mark the block movable so that blocks are reserved for
* movable at startup. This will force kernel allocations
* to reserve their blocks rather than leaking throughout
* the address space during boot when many long-lived
* kernel allocations are made. Later some blocks near
* the start are marked MIGRATE_RESERVE by
* setup_zone_migrate_reserve()
*/
if ((pfn & (pageblock_nr_pages-1)))
set_pageblock_migratetype(page, MIGRATE_MOVABLE);
INIT_LIST_HEAD(&page->lru);
#ifdef WANT_PAGE_VIRTUAL
/* The shift won't overflow because ZONE_NORMAL is below 4G. */
if (!is_highmem_idx(zone))
set_page_address(page, __va(pfn << PAGE_SHIFT));
#endif
}
}
我们可以看到,对于每个页面.都会经过如下初始化:
page = pfn_to_page(pfn);
set_page_links(page, zone, nid, pfn);
init_page_count(page);
reset_page_mapcount(page);
SetPageReserved(page);
pfn_to_page()将页面号转换成page结构.
set_page_links()用来设置page所属的节点,zone, section
init_page_count()/reset_page_mapcount()用来初始页面的相关引用计数
SetPageReserved()用来将页面设置为保留.
顺便说一句,在init_currently_empty_zone()中会调用zone_init_free_lists()来初始化zone对应的free_area
运行到这里之后,已经为页面建立好了page结构.初始化了zone区的伙伴系统.不过此时page全部置为保留状态.伙伴系统中的freaa_area还没有页面.我们继续来看接下来的初始化.
在经过setup_arch()的辛勤劳动后,内存管理初具雏形,不过还没完,还有更加重要的在后面.继续看start_kernel() à build_all_zonelists():
该函数用来初始化节点的zonelist.在伙伴系统的分析中我们分析如,zone区的请求是有次序的,例如,要请假ZONE_HIGHMEM中的内存.如果ZONE_HIGHMEM没有空闲内存了,就会到ZONE_NORMAL.如果还是没有空闲内存,就会到ZONE_DMA中分配了.这个过程是由zonelist控制的,
void build_all_zonelists(void)
{
……
__build_all_zonelists(NULL);
……
}
接下来往下看:
__build_all_zonelists() -à build_zonelists():
static void build_zonelists(pg_data_t *pgdat)
{
……
for (i = 0; i < MAX_NR_ZONES; i++) {
struct zonelist *zonelist;
zonelist = pgdat->node_zonelists + i;
j = build_zonelists_node(pgdat, zonelist, 0, i);
……
}
……
}
Build_zonelists_node():
static int build_zonelists_node(pg_data_t *pgdat, struct zonelist *zonelist,
int nr_zones, enum zone_type zone_type)
{
struct zone *zone;
BUG_ON(zone_type >= MAX_NR_ZONES);
zone_type++;
do {
zone_type--;
zone = pgdat->node_zones + zone_type;
if (populated_zone(zone)) {
zonelist->zones[nr_zones++] = zone;
check_highest_zone(zone_type);
}
} while (zone_type);
return nr_zones;
}
如上所示,在节点中,为每一个zone区建立了一个zonelist. 这里zonelist表示了页面分配的先后顺序
到这里,伙伴系统已经全部初始化了,只要等待往里面塞空闲页面了.这过程是在mem_init()中完成的:
void __init mem_init(void)
{
……
totalram_pages += free_all_bootmem();
……
set_highmem_pages_init(bad_ppro);
……
}
Free_all_bootmem():将bootmem中的所有空闲页面释放到伙伴系统.还会将空闲位图所占的内存释放
set_highmem_pages_init(bad_ppro): 将高端页面释放到伙伴系统.bootmem中的页面全部是内核可直接寻址的页面.
.到这里,bootmem已经失去了作用,也不可以再用了.现在伙伴系统已经可以使用了. 最后的内存初始化步骤为初始化slab分配器.
这是在kmem_cache_init()中完成的.在这个函数里,它会初始化cache_cache和几个普通缓存.
OK.到这里,内存初始化全部完成. ^_^