1、刚工作时做Linux 流控;后来做安全操作系统;再后来做操作系统加固;现在做TCP 加速。唉!没离开过类Unix!!!但是水平有限。。
全部博文(353)
分类:
2013-01-07 17:56:43
原文地址:Linux内核内存初始化 作者:wangjianchangdx
wjcdx@qq.com
@仅供学习交流,勿作商业使用
Linux Kernel Code: 2.6.35.7
ULK3: A.1. Prehistoric Age: the BIOS
The BIOS uses Real Mode addresses because they are the only ones available when the computer is turned on. A Real Mode address is composed of a seg segment and an off offset; the corresponding physical address is given by seg*16+off. As a result, no Global Descriptor Table, Local Descriptor Table, or paging table is needed by the CPU addressing circuit to translate a logical address into a physical one. Clearly, the code that initializes the GDT, LDT, and paging tables must run in Real Mode.
Intel Manual 3a: 3.1 MEMORY MANAGEMENT OVERVIEW
When operating in protected mode, some form of segmentation must be used. There is no mode bit to disable segmentation. The use of paging, however, is optional.
In protected mode, the IA-32 architecture provides a normal physical address space of 4 GBytes (232 bytes). This is the address space that the processor can address on its address bus.
CPU的寻址空间取决于其地址总线宽度
At the system-architecture level in protected mode, the processor uses two stages of address translation to arrive at a physical address: logical-address translation and linear address space paging.
Even with the minimum use of segments, every byte in the processor’s address space is accessed with a logical address.
所有地址最开始都是逻辑地址:指令使用的操作数的地址都是逻辑地址。
Software enables paging by using the MOV to CR0 instruction to set CR0.PG. Before doing so, software should ensure that control register CR3 contains the physical address of the first paging structure that the processor will use for linear-address translation (see Section 4.2) and that structure is initialized as desired.
如何确定页目录中存储的是物理地址,还是下一级页目录?
64-ia-32-architectures-software-developer-vol-3a-3b-system-programming-manual
4.2 HIERARCHICAL PAGING STRUCTURES: AN OVERVIEW
In the examples above, a paging-structure entry maps a page with 4-KByte page frame when only 12 bits remain in the linear address; entries identified earlier always reference other paging structures. That may not apply in other cases. The following items identify when an entry maps a page and when it references another paging structure:
If a paging-structure entry maps a page when more than 12 bits remain in the linear address, the entry identifies a page frame larger than 4 KBytes. For example, 32-bit paging uses the upper 10 bits of a linear address to locate the first paging-structure entry; 22 bits remain. If that entry maps a page, the page frame is 222 Bytes = 4MBytes. 32-bit paging supports 4-MByte pages if CR4.PSE = 1. PAE paging and IA-32e paging support 2-MByte pages (regardless of the value of CR4.PSE). IA-32e paging may support 1-GByte pages (see Section 4.1.4).
cache & tlb
general hardware cache:
tlb cache:
页表初始化:
从上图可以看出,PGD的地址存放在cr3,PGD中存有PUD的基址,以此类推,所以各级PxD的存放位置,可以有一定的灵活性;
PAGE_OFFSET
arch/x86/include/asm/page_32_types.h:
#define __PAGE_OFFSET _AC(CONFIG_PAGE_OFFSET, UL)
The PAGE_OFFSET macro yields the value 0xc0000000; this is the offset in the linear address space of a process where the kernel lives.(ULK3)
补:head_32.S中临时页表初始化: #else /* Not PAE */
页表初始化的第二阶段: setup_arch(&command_line); /* 其中进行页表的第二阶段初始化 */
呵呵,这里先暂时之研究页表初始化部分
x86_init
defined in arch/x86/kernel/x86_init.c:
struct x86_init_ops x86_init __initdata = {E820 setup_arch()->setup_memory_map()->x86_init.resources.memory_setup(); e820_print_map(who);
http://blog.chinaunix.net/space.php?uid=1701789&do=blog&id=263951
e820确定各段内存状态;映射低端内存(896M)和高端内存;zone相关操作;有点琐碎,却也好理解
buddy system, slab初始化;
低端内存页表初始化 /*
setup_arch()->init_memory_mapping()->kernel_physical_mapping_init()
cr3还是swapper_pg_dir;pmd和pte不存在的重新分配;
为了弄清楚分配页表时,针对哪些内存,需要弄清楚:
early_ioremap_init(): 初始化ioremap的fixmap段,占一个pmd
setup_memory_map(): e820 memory setup, ram size and status.
pfn: 页框号
TODO: 启动初期内存页的分配方式
in setup_arch:
/* How many end-of-memory variables you have, grandma! */究竟有多少标识内存结束的变量,NND!
find_low_pfn_range
/* max_pfn_mapped is updated here */
max_low_pfn_mapped = init_memory_mapping(0, max_low_pfn<<PAGE_SHIFT);
max_pfn_mapped = max_low_pfn_mapped;
max_low_pfn是低端内存的最大页框号,
#ifdef CONFIG_X86_32
/* max_low_pfn get updated here */
find_low_pfn_range();
#else
||
\/
void __init find_low_pfn_range(void)
{
/* it could update max_pfn */
if (max_pfn <= MAXMEM_PFN)
lowmem_pfn_init();
else
highmem_pfn_init();
}
||
\/
/*
* All of RAM fits into lowmem - but if user wants highmem
* artificially via the highmem=x boot parameter then create
* it:
*/
void __init lowmem_pfn_init(void)
==
/*
* We have more RAM than fits into lowmem - we try to put it into
* highmem, also taking the highmem=x boot parameter into account:
*/
void __init highmem_pfn_init(void)
当然注释中,还提到了启动参数中有highmem参数的情况,这里忽略;
init_memory_mapping
max_low_pfn_mapped = init_memory_mapping(0, max_low_pfn<<PAGE_SHIFT);
||
\/
/*
* Setup the direct mapping of the
physical memory at PAGE_OFFSET.
* This runs before bootmem is
initialized and gets pages directly from
* the physical memory. To access
them they are temporarily mapped.
*/
unsigned long __init_refok init_memory_mapping(unsigned long start,
unsigned long end)
从中可以看出:
在init_memory_mapping中使用了struct map_range:
struct map_range mr[NR_RANGE_MR];
这里的map_range与e820.map的区别之处在于:
e820.map中start_addr和size都是64位的,所以,可以表示超过4G的内存范围;
接下来要研究的是:
/* head if not big page alignment ? */
start_pfn = start >> PAGE_SHIFT;
pos = start_pfn << PAGE_SHIFT;
#ifdef CONFIG_X86_32
/*
* Don't use a large page for the first 2/4MB
of memory
* because there are often fixed size MTRRs in
there
* and overlapping MTRRs into large pages can
cause
* slowdowns.
*/
if (pos == 0)
end_pfn = 1<<(PMD_SHIFT - PAGE_SHIFT);
else
end_pfn = ((pos + (PMD_SIZE - 1))>>PMD_SHIFT)
<< (PMD_SHIFT - PAGE_SHIFT);
上面是基于页框号从1开始,从1开始时,end_pfn为最后一个页框的编号;若从0开始,则为第一个页框的编号;
准确的说,“1”和“((pos + (PMD_SIZE - 1))>>PMD_SHIFT)”表示“pos”所在的PMD号,从1开始;
struct map_range
到/* big page (2M) range */的时候,亦即第二次计算mr的时候,pos(pos = end_pfn << PAGE_SHIFT;)就已经指向第二个(下一个)PMD的起始地址了。
可以看出这三次计算map_range是为了将内存按PMD对齐的地址分段,第一个PMD对齐的地址之前的开头和最后一个PMD对齐的地址之后的结尾,以小页管理;而中间整数倍个PMD的内存按大页管理;
page_size_mask: 0 : 4K
1<
save_mr很简单,将map_range存在mr中,注意三段内存的调用方式:
接下来,try to merge same page size and continuous,然后把mr打印出来;
Find space for the kernel direct mapping tables. 到初期内存分配机制了
pmd, pud, pte, pgd等变量值的影响还不是很大,暂时不用解决;但最后还是要弄清楚 相关宏的定义的;
#define roundup(x, y) \
(
(
(
(x) + ( (y) - 1 )
) / (y)
) * (y)
)
e820_table_start:
e820_table_end:
e820_table_top:
接下来要研究的:
关于max_pfn_mapped的迷惘
错误观点:在CONFIG_X86_32的情况下,max_pfn_mapped直到max_low_pfn_mapped设置之后,才被设置
init_memory_mapping函数初始化直接映射区,亦即低端内存的页表,他的调用方式如下:
max_low_pfn_mapped = init_memory_mapping(0, max_low_pfn<<PAGE_SHIFT);函数原型如下:
unsigned long __init_refok init_memory_mapping(unsigned long start,max_low_pfn<
在e820.map中查找能够容纳下所有页表的内存段,此时传入的end也是max_low_pfn<
诚如前面评论所说,经过对setup_arch函数的观察,max_low_pfn在init_memory_mapping函数结束之后才会被设置,那此时其值为0,0如何作为查找依据呢;
经过搜寻发现,在head_32.S中建立临时页表时,有对max_low_pfn进行设置:
shrl $12, %eax好吧,我错了,这里传入的是max_low_pfn,而在head_32.S建立临时页表时,更新的是max_pfn_mapped,不过也可以得出一个结论:
max_pfn_mapped是被动更新的,其值表示当前映射的最大页框号。
PS: max_low_pfn在之前的find_low_pfn_range()中更新;
页表空间的分配
find_early_table_space首先计算各级页目录所占用的内存页数,然后调用
e820_table_start = find_e820_area(start, max_pfn_mapped<<PAGE_SHIFT,查找e820.map中能装下各级页目录的内存段;并且页表的地址不能超过已映射的最大地址, 即e820.map.start
+ sizeof(各级页目录)
< max_pfn_mapped<
如上一评论所说,max_pfn_mapped在建立临时页表时被设置,标识当前映射的最大的页框号;
那如何确保max_pfn_mapped中能够存下各级页目录呢?
还记得建立临时页表时额外映射了一段RESERVED的空间:
/*MAPPING_BEYOND_END看名字就与映射有关,转到它的定义:
/* Enough space to fit pagetables for the low memory linear map */注释部分已说明了,它足够存放所有的页表;
2^32是4G,PAGE_OFFSET是3G,翻译一下定义就是1G内核空间(0xc0000000~0xffffffff)的页表的大小;
看PAGE_TABLE_SIZE的定义:
#if PTRS_PER_PMD > 1看第二种情况,PTRS_PER_PMD不大于1,则等于1,说明只有PGD、PTE两级页目录,PGD占10位,PTE占10位,Offset占12位;
numberof(pages)/PTRS_PER_PGD只是页表(PTE)所占的内存空间的大小,那PGD的不计在内吗?是否太冒险?
不冒险,PGD存储在swap_pg_dir所指的数组里,已经静态分配好了。
这里不知道为什么不用PTRS_PER_PTE,PTRS_PER_PMD > 1即PGD/PMD/PTE三级页目录的情况也暂时看不懂;
中印证了没有为PGD分配空间的说法:
1024 should be enough; the pgd is still swapper_pg_dir, and there are no pmds.
find_early_table_space找到足够的空间来存放页表,并初始化标识这段页表内存空间的几个变量:
Linux页目录结构
下面是kernel_physical_mapping_init,进行页表映射了,在看此函数之前,要先看一下Linux中页表层级的定义。
Linux中x86定义了两种层级:
宏PxD_SHIFT,PTRS_PER_PxD中实际使用的定义于pgtable-2level_types.h和pgtable-3level_types.h:
/*
* traditional i386 two-level paging structure:
*/
#define PGDIR_SHIFT 22
#define PTRS_PER_PGD 1024
/*
* the i386 is two-level, so we don't really have any
* PMD directory physically.
*/
#define PTRS_PER_PTE 1024
未使用的定义于include/asm-generic/pgtable-nop{u,m}d.h:
#define PMD_SHIFT PUD_SHIFT
#define PTRS_PER_PMD 1
#define PMD_SIZE (1UL << PMD_SHIFT)
#define PMD_MASK (~(PMD_SIZE-1))
这里,我们研究最普通的情况:
alloc & init PGD/PMD/PTE
kernel_physical_mapping_init循环映射各级页表。
调用one_md_table_init分配存储PMD目录的内存页,调用one_page_table_init分配存储PTE页表的内存页。这里分配的是前面刚找到的内存空间,分配是以页为单位的。 pte = one_page_table_init(pmd);
/*这里,通过e820_table_end的使用,我们了解到e820_table_end等的意义。
页目录项属性
__pmd(__pa(page_table) | _PAGE_TABLE)
pmd entry的属性也设置了,是_PAGE_TABLE
舍弃临时页表?这里如何处理临时映射的页表呢?
仍使用原来的页表;
one_page_table_init最开始会判断页表是否已存在:if (!(pmd_val(*pmd) & _PAGE_PRESENT)) {;
临时页表映射的页表已存在,直接返回其中存储的页表的地址return pte_offset_kernel(pmd, 0);
高端内存页表初始化
如上面的图中所显示,3G~4G的内核空间被分成了直接映射区、动态映射区(vmalloc)、永久映射区(kmap)、固定映射区(fixed mapping), 前面已经对直接映射区也就是低端内存的映射有了一定的了解。
接下来也了解一下高端内存的映射情况。
固定映射区
首先,跟踪init_memory_mapping,其中在初始化完成lowmem之后,调用了early_ioremap_page_table_range_init()初始化fixed mapping固定映射区。
void __init early_ioremap_page_table_range_init(void)由上可以看出:
page_table_range_init(vaddr, end, pgd_base);vaddr是fixed addr区最小的地址,end是最大的地址,由page_table_range_init完成这个区页表的初始化任务:从pgd关联到页表,但页表中的entry并不关联到实际的内存页。
end = pmd_number_of(FIXADDR_TOP) + 1
(FIXADDR_TOP + PMD_SIZE - 1) & PMD_MASK相当于上取整,对除以PMD_SIZE所得的商上取整。
其中PMD_SIZE = 1 << PMD_SHIFT
回退到setup_arch中,之后其调用initmem_init(0, max_pfn, acpi, k8);初始化bootmem allocator,逻辑有点复杂,暂不研究。
再之后,高端内存区初始化部分
x86_init.paging.pagetable_setup_start(swapper_pg_dir);native_pagetable_setup_done为空,native_pagetable_setup_start清空高端内存的页表(pte)映射。
下面就到了名不副实的paging_init()
void __init paging_init(void)永久映射区 static void __init pagetable_init(void)
建立4M永久映射区的页表;
static void __init kmap_init(void)
FIX_KMAP_BEGIN~FIX_KMAP_END 之间的固定映射区是per-cpu的,这里的意义,要到研究固定映射区的使用时才清楚。
页表小结
整个内核空间及其页表的初始化的研究基本上已经结束了。
bootmem allocator:
Reserve内存区
e820_table_start是什么位置呢?
1. 是在brk_base之后,跟在临时页表后面吗?
2. 如何确保加载内核代码的内存不被覆盖呢?
满足第一个要求,可以有两种途径:
若要满足第二个,则只能使用e820.map的方法了,并且之后,还要根据内存的使用情况,初始化其他内存分配器。
e820的初始化即在setup_arch中,跟踪代码并没有发现reserve kernel image的地方,网上资料相关甚少。
好吧,把内核编译出来实际看看打印出来的e820.map:
diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
index cdb4ae9..b278535 100644
--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -75,7 +75,6 @@ static void __init find_early_table_space(unsigned long end, int use_pse,
#else
start = 0x8000;
#endif
+
e820_print_map("wjcdx");
e820_table_start = find_e820_area(start, max_pfn_mapped<<PAGE_SHIFT,
tables, PAGE_SIZE);
if (e820_table_start == -1UL)
打印出的信息:
BIOS-provided physical
RAM map:
BIOS-e820: 0000000000000000 - 000000000009f800 (usable)
BIOS-e820: 000000000009f800 - 00000000000a0000 (reserved)
BIOS-e820: 00000000000ca000 - 00000000000cc000 (reserved)
BIOS-e820: 00000000000dc000 - 00000000000e4000 (reserved)
BIOS-e820: 00000000000e8000 - 0000000000100000 (reserved)
BIOS-e820: 0000000000100000 - 000000003fef0000 (usable)
BIOS-e820: 000000003fef0000 - 000000003feff000 (ACPI data)
BIOS-e820: 000000003feff000 - 000000003ff00000 (ACPI NVS)
BIOS-e820: 000000003ff00000 - 0000000040000000 (usable)
BIOS-e820: 00000000fec00000 - 00000000fec10000 (reserved)
BIOS-e820: 00000000fee00000 - 00000000fee01000 (reserved)
BIOS-e820: 00000000fffe0000 - 0000000100000000 (reserved)
******
kernel direct mapping tables up to 377fe000 @ 15000-1a000
可以看到:
e820_table_start = 0x15000 000
e820_table_top = 0x1a000 000
位置并不是紧跟在临时映射的页表的后面;
接着又来了一个问题,调用find_e820_area时,传入的start是0x{7,8}000,为什么获得的e820_table_start却这么大呢?
跟进,e820_table_start()->find_early_area()->bad_addr(),只有在bad_addr()中,会改变start:
/* Check for already reserved areas */
static inline int __init bad_addr(u64 *addrp, u64 size, u64 align)
{
int i;
u64 addr = *addrp;
int changed = 0;
struct early_res *r;
again:
i =
find_overlapped_early(addr, addr + size);
r = &early_res[i];
if (i < max_early_res && r->end) {
*addrp = addr = round_up(r->end, align);
changed = 1;
goto again;
}
return changed;
}
这个函数检查全局数组early_res中存储的保留内存的信息,如果有重叠,则后移start;
kernel image reserve的信息会不会存储在这个early_res里呢?
老样子,打印出来吧!
diff --git a/kernel/early_res.c b/kernel/early_res.c
index 7bfae88..c6a4475 100644
--- a/kernel/early_res.c
+++ b/kernel/early_res.c
@@ -44,6 +44,17 @@ static int __init find_overlapped_early(u64 start, u64 end)
return i;
}
+static void __init
early_res_print()
+{
+
int i;
+
struct early_res *r;
+
+
for (i = 0; i < max_early_res && early_res[i].end; i++) {
+
r = &early_res[i];
+
printk(KERN_DEBUG "early_res: %s: %llx-%llx\n", r->name, r->start, r->end);
+
}
+}
+
/*
* Drop the i-th range from the
early reservation map,
* by copying any higher ranges
down one over it, and
@@ -290,7 +301,7 @@ void __init
reserve_early(u64 start, u64 end, char *name)
{
if (start
>= end)
return;
-
+ printk(KERN_DEBUG
"early_res: %s: %llx-%llx\n", name, start, end);
__check_and_double_early_res(start, end);
drop_overlaps_that_are_ok(start, end);
@@ -492,6 +503,7 @@ static inline int
__init bad_addr(u64 *addrp, u64 size, u64 align)
u64 addr =
*addrp;
int changed
= 0;
struct
early_res *r;
+
early_res_print();
again:
i =
find_overlapped_early(addr, addr + size);
r =
&early_res[i];
在reserve的时候,打印出来,在调用bad_addr时,打印出early_res;
early_res: TEXT DATA BSS: 100000-55d0c4
early_res: RAMDISK: 377cc000-37ff0000
Linux version 2.6.35.7-default+ (root@lj) (gcc version 4.2.1 (SUSE Linux)) #3 SMP Mon Oct 17 22:43:57 EDT 2011
BIOS-provided physical
RAM map:
看到TEXT DATA BSS了吧!并且打印的位置是在print banner之前,也就是说,是在进入start_kernel之前,对reserve_early函数lookup reference不难发现是在i386_start_kernl()中,进行保留的:
void __init i386_start_kernel(void)
{
***
reserve_early(__pa_symbol(&_text), __pa_symbol(&__bss_stop), "TEXT DATA BSS");
好吧,这个问题就到这里。
find_e820_area()之前的e820.map:
wjcdx: 0000000000000000 - 0000000000010000 (reserved)K8
Bootmem Allocator在buddy system/slab allocator之前,进行内存管理;
CONFIG_K8_NUMA, K8是AMD的CPU架构。
NEED DO
两个需要研究的问题:
以下,先研究bootmem/buddy system/slab allocator内存管理机制及高端内存管理机制,再研究上面提出的两个页表问题。
In setup_arch():
/*错误的initmem_init中的Bootmem Allocator初始化
initmem_init()有两处定义:numa_32.c,init_32.c,通过
$make arch/x86/kernel/setup.i发现initmem_init()包含于头文件:arch/x86/include/asm/page_types.h
# 40 "/home/wjcdx/linux/linux-2.6/arch/x86/include/asm/page_types.h" 2而包含该头文件的是init_32.c.
initmem_init()有两处定义:numa_32.c,init_32.c,通过
$make arch/x86/kernel/setup.i发现initmem_init()包含于头文件:arch/x86/include/asm/page_types.h
# 40 "/home/wjcdx/linux/linux-2.6/arch/x86/include/asm/page_types.h" 2而包含该头文件的是init_32.c.
呃,上面一条评论的猜测是错的,不过也是这个的铺垫。
真正的原因在于生成setup.i的时候,使用的是最新版本(3.0+)的代码,而我source insight中看的是2.6.35.7的代码,在最新版本中,这两处都是 void initmem_init(void) :)
highstart_pfn = max_low_pfn
vmalloc_start = highstart_pfn,I think.
e820_register_active_regions(0, 0, highend_pfn);
将e820.map中在0~max_pfn之间类型为E820_RAM的map,加入到early_node_map中。
setup_bootmem_allocator();
开始初始化Bootmem Allocator.
void __init
setup_bootmem_allocator(void)
{
#ifndef CONFIG_NO_BOOTMEM
int nodeid;
unsigned long bootmap_size, bootmap;
/*
*
Initialize the boot-time allocator (with low memory only):
*/
bootmap_size = bootmem_bootmap_pages(max_low_pfn)<<PAGE_SHIFT;
bootmap = find_e820_area(0, max_pfn_mapped<<PAGE_SHIFT, bootmap_size,
PAGE_SIZE);
if (bootmap == -1L)
panic("Cannot find bootmem map of size
%ld\n", bootmap_size);
reserve_early(bootmap, bootmap + bootmap_size, "BOOTMAP");
#endif
printk(KERN_INFO " mapped low ram: 0 - %08lx\n",
max_pfn_mapped<<PAGE_SHIFT);
printk(KERN_INFO " low ram: 0 - %08lx\n", max_low_pfn<<PAGE_SHIFT);
#ifndef CONFIG_NO_BOOTMEM
for_each_online_node(nodeid) {
unsigned long start_pfn, end_pfn;
#ifdef
CONFIG_NEED_MULTIPLE_NODES
start_pfn = node_start_pfn[nodeid];
end_pfn = node_end_pfn[nodeid];
if (start_pfn > max_low_pfn)
continue;
if (end_pfn > max_low_pfn)
end_pfn = max_low_pfn;
#else
start_pfn = 0;
end_pfn = max_low_pfn;
#endif
bootmap = setup_node_bootmem(nodeid, start_pfn, end_pfn,
bootmap);
}
#endif
after_bootmem = 1;
}
||
\/
in
setup_bootmem_allocator:
bootmap =
setup_node_bootmem(nodeid, start_pfn, end_pfn,
bootmap);
||
\/
In
setup_node_bootmem
bootmap_size = init_bootmem_node(NODE_DATA(nodeid),
bootmap >> PAGE_SHIFT,
start_pfn, end_pfn);
***
return bootmap + bootmap_size;
||
\/
return init_bootmem_core(pgdat->bdata, freepfn, startpfn, endpfn);
||
\/
In init_bootmem_core:
bdata->node_bootmem_map = phys_to_virt(PFN_PHYS(mapstart));
***
mapsize = bootmap_bytes(end - start);
memset(bdata->node_bootmem_map, 0xff, mapsize);
关于setup_bootmem_allocator的疑问:
bootmap_bytes static unsigned long __init bootmap_bytes(unsigned long pages)
支线任务不做过多停留,姑且认为bootmap已初始化完成。
PS: 在最新版(3.0+)的代码中,init_32.c中的setup_bootmem_allocator只有两个printk,却初始化了after_bootmem = 1;,那么接下来如何分配内存呢?Directly use early_pages_start, or bootmem allocator?
reserve_bootmem & free_bootmem,设置/清零位图,逻辑不复杂,不再赘述。
可参考:
DUMP 内存状态 In setup_arch:
将early_res中存储的内存保留状态,复制到Bootmem Allocator中去。
zone/buddy system/slab概览 In init_32.c::paging_init():
参考:http://blog.chinaunix.net/space.php?uid=20543183&do=blog&id=1930810
In start_kernel:
由难以分析NUMA引起的initmem_init定义的醒悟
到这里似乎卡住了,再要往下进行遇到了很多NUMA相关的数据结构,但是他们的出现似乎显得很突兀:不是被忽略,而是之前没有碰到过。那他们在哪里被初始化呢?
跟踪启动过程,无果。
跟踪NODE_DATA到numa_32.c,查看该文件中对struct pglist_data亦即pg_data_t相关的的函数,还有最后有些熟悉的initmem_init函数,并且发现,分配pg_data的函数allocate_pgdat也被他调用。然而,由之前的分析,在setup_arch中调用的initmem_init()函数是init_32.c中的。
会不会是之前分析错了呢?
查看initmem_init的定义,initmem_init函数有多处定义,却只有一处声明,那就是page_types.h.
init_32.c显式包含page_types.h,那么numa_32.c有没有可能隐式包含呢?毕竟头文件千丝万缕的包含关系,很难缕清。
$make arch/x86/mm/numa_32.i发现声明也来自于page_types.h.
再看init_32.c中initmem_init的定义:
#ifndef CONFIG_NEED_MULTIPLE_NODES只有当CONFIG_NEED_MULTIPLE_NODES没定义时才会定义,看来如果定义了CONFIG_NUMA,init_32.c中的initmem_init函数就不会被编译了。
这样就理顺了。
Linux内存管理模型在查阅资料和配置内核的过程中,发现Linux内存管理模型:FLAT, SPARSE, DISCONTIG(discontiguous).
看一下numa_32.c中的initmem_init,忽略sparse相关代码,因为他是一个比较新的特性。
NUMA中的initmem_init
numa_32.c::initmem_init:
static inline void get_memcfg_numa(void)三种方法初始化内存layout. 初始化的全局变量:
node_start_pfn/node_end_pfn,并且调用e820_register_active_regions将发现的内存注册到early_node_map中去。
在calculate_numa_remap_pages中,为每一个node,在各自的node上分配一个node_remap_size[nid]大小的map,
The map is kept near the end physical page range that has already been registered.
在get_memcfg_numa中,node_remap_size[nid] = (start_pfn - end_pfn + 1) * sizeof(struct page) 而在calculate_numa_remap_pages中,node_remap_size[nid] = node_remap_size[nid] + sizeof(pg_data_t), 并初始化node_remap_offset.
虽然这些node_remap在物理上是不连续的,但是在node_remap_offset中记录的offset是连续的,这就暗示,有可能在后来map的linear address space是连续的,很可能就是kernel virtual address space.
numa_32.c::initmem_init():
for_each_online_node(nid) {
numa_32.c::initmem_init():
kva_pages = roundup(calculate_numa_remap_pages(), PTRS_PER_PTE);各node remap即kva映射在低端内存的最顶端。
zone_size_init
接下来进入paging_init()->zone_sizes_init().
static void __init zone_sizes_init(void)
zone_sizes_init()简单的初始化max_zone_pfns,并调入free_area_init_node().
struct page结构的分配
各node种的页框所需要的struct page描述符在initmem_init中也已经分配了。
这句并不准确,在initmem_init中只是预留空间,也可以说是分配。
/**
该函数现实调用sort_node_map(),对node_map进行排序,然后初始化一些全局变量; find_movable_zones(不予考虑);然后
/* Initialise every node */
void __paginginit free_area_init_node(int nid, unsigned long *zones_size,
/*
memmap_init() => memmap_init_zone():
/*
接下来是start_kernel()->build_all_zonelists(),build zonelist,逻辑还是很清晰的,只要对zonelist的用法有所了解,理清这里的初始化,还是很轻松的。
再接下来是start_kernel()->mm_init(), 该函数将bootmem中的页框释放到buddy system中,初始化slab allocator,初始化动态映射区(vmalloc)区域,下面一一概述。
slab allocator初始化
mm_init()->mem_init():
void __init mem_init(void)
free_all_bootmem()也不再细说了。
void __init
kmem_cache_init(void)
{
/* Bootstrap is
tricky, because several objects are allocated
* from
caches that do not exist yet:
* 1)
initialize the cache_cache cache: it contains the struct
*
kmem_cache structures of all caches, except cache_cache itself:
*
cache_cache is statically allocated.
*
Initially an __init data area is used for the head array and the
*
kmem_list3 structures, it's replaced with a kmalloc allocated
*
array at the end of the bootstrap.
* 2)
Create the first kmalloc cache.
*
The struct kmem_cache for the new cache is allocated normally.
*
An __init data area is used for the head array.
* 3)
Create the remaining kmalloc caches, with minimally sized
*
head arrays.
* 4)
Replace the __init data head arrays for cache_cache and the first
*
kmalloc cache with kmalloc allocated arrays.
* 5)
Replace the __init data for kmem_list3 for cache_cache and
*
the other cache's with kmalloc allocated memory.
* 6) Resize
the head arrays of the kmalloc caches to their final sizes.
*/
}
start_kernel()
||
\/
void __init
kmem_cache_init_late(void)
{
/* 6) resize the
head arrays to their final sizes */
}
slab的结构主要有cache, slab, kobject等
FROM ULK3:
系统静态分配了一个cache_cache(可以顾名思义), 用于存放slab算法所使用的cache数据结构; 还静态分配了array_cache和kmem_list3供初始化slab时使用;
该初始化过程,首先在cache_cache中分配array_cache、kmem_list3类型的cache, 分配2的指数大小的cache,如此就初始化好了slab本身所要使用的数据结构,接下来使用kmalloc分配array_cache和kmem_list3替换静态分配供临时使用的initarray_cache和initkmem_list3;
kmem_cache_init_late()->enable_cpucache()->do_tune_cpucache()->alloc_arraycache()为每个CPU分配array_cache, 因为在kmem_cache_init只分配了第一个CPU的array_cache.
vmalloc_init void __init vmalloc_init(void)
在页表初始化的分析中,始终找不到初始化动态映射区部分线性空间页表项的代码,这里终于碰到了明显的vmalloc_init,但是看到这个函数,未免有些后怕, 这个函数很明显是要遍历全局变量vmlist中的vmap_area结构,那么是在什么时候向vmlist中添加vmap_area的呢?是遗漏了什么吗?
经过仔细查找,确实没有遗漏,想要确实证明这个,只有编译升级内核,查看运行时的状态了。
diff --git a/mm/vmalloc.c b/mm/vmalloc.c查看打印信息,vmlist确实是空的:
### vmalloc_init内存分配机制小结
ULK3::8.2.11. Local Caches of Free Slab Objects
The Linux 2.6 implementation of the slab allocator for multiprocessor systems differs from that of the original Solaris 2.4. To reduce spin lock contention among processors and to make better use of the hardware caches, each cache of the slab allocator includes a per-CPU data structure consisting of a small array of pointers to freed objects called the slab local cache. Most allocations and releases of slab objects affect the local cache only; the slab data structures get involved only when the local cache underflows or overflows.
各种内存分配使用的方法,都很容易理解,不在多说,这里大体总结一下各个分配机制之间的关系:
kmalloc调用slab allocator从slab中分配数据结构,slab会首先从Local cache中分配,如果local cache中没有,则会分配新的slab,分配slab时,会调用zone allocator分配内存页,每个zone一个buddy system, zone中的内存页又有可能分布在不同的NUMA node上。
当然alloc_pages族可以直接调用zone allocator.
进程的页表
进程的struct mm结构中,有一个域pgd_t * pgd指向其页表;
进程中的各线程共用内存空间等资源;
进程在创建新进程时会dump父进程的内存页表,所有进程都是init进程的子进程,init进程的内存页表,就是系统启动是分配并初始化的页表。