Chinaunix首页 | 论坛 | 博客
  • 博客访问: 245117
  • 博文数量: 33
  • 博客积分: 2511
  • 博客等级: 少校
  • 技术积分: 391
  • 用 户 组: 普通用户
  • 注册时间: 2008-06-06 09:24
文章分类
文章存档

2011年(3)

2010年(9)

2009年(3)

2008年(18)

我的朋友

分类: LINUX

2008-09-27 14:43:58

Understanding the Linux Kernel Initcall Mechanism

Creating Dynamic Function-Pointer Call Tables

by

Abstract

While browsing through the Linux kernel, I came upon a technique for creating a segment of function pointers which can be called at a later time in the order that they are inserted into the segment. The kernel uses this mechanism to call device driver initialization routines at boot-up. I had a hard time finding clear information about what was going on and how it worked. After much googling (and some code doodling) I understand what is happening. This paper is an attempt at putting what I understand of this process down on paper so others may hopefully find it useful.

NOTE: This technique depends on, and requires the use of, the GNU compiler and linker tools. Additionally, the ELF binary executable format must be used. This is not a general-puropse ANSI-C compliant technique.

Table of Contents

1.Introduction

Date and Version

The original version of this paper was started on the 26th of August 2003; the kernel code quoted refered to version 2.4.22 and the binutils utility came from binutils-1.4. This document was rechecked on the 11th of October, 2006. At that time small modifications were made to the text, all code was re-checked, and all output re-generated from the new code. The kernel being used was 2.6.18, binutils 2.15.94.0.2.2, and gcc 4.0.0.

Notation

All kernel files that are referenced in this paper are specifed by a path name relative to the kernel's root directory. For example the setup.c file for the PowerPC architecture would be given as: arch/ppc/kernel/setup.c. A specific function (e.g. early_init) within a given file is expressed as arch/ppc/kernel/setup.c:early_init() regardless of what parameters it accepts (if any) and what it returns (if anything).

Architecture

As far as I know, this mechanism is not architecture-dependent. I actually found it while tracing through the boot process of the PowerPC architecture, but my code tests were performed on an x86-based machine. The code that makes this work is in the kernel's init/ directory, which (as is my understanding), contains initialization code which is used by all architectures. The ability to use this mechanism depends more on specific support given by the GNU tools and the ELF executable format rather than architecture-specific support.

Tools

Peering into the world of object files is made easier using the tools:

  • objdump

  • nm and

  • readelf

"objdump -t" output format

I often dump the symbols of a file using objdump -t. I can't seem to easily locate any documentation on the output format so I've included some quick notes here. A typical use of this tool would look something like the following:

[trevor]$ objdump -t add.o      

add.o: file format elf32-i386

SYMBOL TABLE:
00000000 l df *ABS* 00000000 add.c
00000000 l d .text 00000000
00000000 l d .data 00000000
00000000 l d .bss 00000000
00000000 l d .comment 00000000
00000000 g F .text 0000000b add

Here is my understanding of the column descriptions. This information comes from browsing through bfd/syms.c:bfd_print_symbol_vandf() (where "vandf" stands for "value and flags").

The flags which are described above are part of a larger set of symbols and attributes which are defined in bfd/bfd.h. The entire set of flags (or attributes) and their meanings are given below. NOTE: objdump -t doesn't try to display the values of all the possible flags, just the ones mentioned above.

Definition


Symbol

Description

0x00000


BSF_NO_FLAGS

placeholder for no defined flags

0x00001


BSF_LOCAL

The symbol has local scope (i.e. a static in C). VALUE(1) is this symbol's offset into the data section.

0x00002


BSF_GLOBAL

The symbol has global scope (i.e. initialized data in C). VALUE(2) is this symbol's offset into the data section.

BSF_GLOBAL


BSF_EXPORT

This symbol has global scope and is exported. Same as BSF_GLOBAL.

0x00008


BSF_DEBUGGING

The symbol is a debugging record. The VALUEs are arbitrary, unless BSF_DEBUGGING_RELOC is set.

0x00010

ELF

BSF_FUNCTION

Function entry point.

0x00020


BSF_KEEP

used by the linker

0x00040


BSF_KEEP_G

used by the linker

0x00080


BSF_WEAK

Weak global symbol. This symbol is overridable (without warning) by a regular global symbol of the same name.

0x00100

ELF

BSF_SECTION_SYM

This symbol points to a section.

0x00200


BSF_OLD_COMMON

This symbol used to be *COM*, but is now allocated.

0x00400

COFF

BSF_NOT_AT_END

This symbol appears where it is declared and not at the end of a section.

0x00800


BSF_CONSTRUCTOR

This symbol indicates the start of the constructor section.

0x01000


BSF_WARNING

The presence of this symbol acts to indicate that there is a warning on the next symbol.

0x02000


BSF_INDIRECT

This symbol is an indirect pointer to the symbol with the same name as the next symbol.

0x04000

ELF

BSF_FILE

This symbol contains a filename.

0x08000

ELF

BSF_DYNAMIC

This symbol is associated with dynamic linking.

0x10000

ELF

BSF_OBJECT

This symbol denotes a data object.

0x20000


BSF_DEBUGGING_RELOC

This is a debugging symbol. VALUE(1) is the offset into the data section. BSF_DEBUGGING should be set too.

0x40000

ELF

BSF_THREAD_LOCAL

This symbol is used for thread local storage.


2.Motivation

What motivated me to explore this item? I was looking through the source code of the Linux kernel, trying to get my head around the exact steps and workings of how the kernel boots. Partway through the code I came across some code I just couldn't figure out, it just didn't make sense to me how the code was working.

In init/main.c:do_basic_setup() is a call to do_initcalls() which is defined as:

static void __init do_initcalls(void)
{
initcall_t *call;
int count = preempt_count();

for (call = __initcall_start; call < __initcall_end; call++) {
char *msg = NULL;
char msgbuf[40];
int result;

if (initcall_debug) {
printk("Calling initcall 0x%p", *call);
print_fn_descriptor_symbol(": %s()",
(unsigned long) *call);
printk("\n");
}

result = (*call)();

if (result && result != -ENODEV && initcall_debug) {
sprintf(msgbuf, "error code %d", result);
msg = msgbuf;
}
if (preempt_count() != count) {
msg = "preemption imbalance";
preempt_count() = count;
}
if (irqs_disabled()) {
msg = "disabled interrupts";
local_irq_enable();
}
if (msg) {
printk(KERN_WARNING "initcall at 0x%p", *call);
print_fn_descriptor_symbol(": %s()",
(unsigned long) *call);
printk(": returned with %s\n", msg);
}
}

/* Make sure there is no pending stuff from the initcall sequence */
flush_scheduled_work();
}

Searching hi and low for __initcall_start reveals that it doesn't appear in any *.c source files anywhere, it only appears in the linker scripts (*.lds) for the various architectures.

[trevor@trevor linux-2.6.18]$ grep -r __initcall_start *
arch/alpha/kernel/vmlinux.lds.S: __initcall_start = .;
arch/arm/kernel/vmlinux.lds.S: __initcall_start = .;
arch/arm26/kernel/vmlinux-arm26-xip.lds.in: __initcall_start = .;
arch/arm26/kernel/vmlinux-arm26.lds.in: __initcall_start = .;
arch/cris/arch-v10/vmlinux.lds.S: __initcall_start = .;
arch/cris/arch-v32/vmlinux.lds.S: __initcall_start = .;
arch/frv/kernel/vmlinux.lds.S: __initcall_start = .;
arch/h8300/kernel/vmlinux.lds.S: ___initcall_start = .;
arch/i386/kernel/vmlinux.lds.S: __initcall_start = .;
arch/ia64/kernel/vmlinux.lds.S: __initcall_start = .;
arch/m32r/kernel/vmlinux.lds.S: __initcall_start = .;
arch/m68k/kernel/vmlinux-std.lds: __initcall_start = .;
arch/m68k/kernel/vmlinux-sun3.lds: __initcall_start = .;
arch/m68knommu/kernel/vmlinux.lds.S: __initcall_start = .;
arch/mips/kernel/vmlinux.lds.S: __initcall_start = .;
arch/parisc/kernel/vmlinux.lds.S: __initcall_start = .;
arch/powerpc/kernel/vmlinux.lds.S: __initcall_start = .;
arch/ppc/kernel/vmlinux.lds.S: __initcall_start = .;
arch/s390/kernel/vmlinux.lds.S: __initcall_start = .;
arch/sh/kernel/vmlinux.lds.S: __initcall_start = .;
arch/sh64/kernel/vmlinux.lds.S: __initcall_start = .;
arch/sparc/kernel/vmlinux.lds.S: __initcall_start = .;
arch/sparc64/kernel/vmlinux.lds.S: __initcall_start = .;
arch/v850/kernel/vmlinux.lds.S: ___initcall_start = . ; \
arch/x86_64/kernel/vmlinux.lds.S: __initcall_start = .;
arch/xtensa/kernel/vmlinux.lds.S: __initcall_start = .;
include/asm-um/common.lds.S: __initcall_start = .;
init/main.c:extern initcall_t __initcall_start[], __initcall_end[];
init/main.c: for (call = __initcall_start; call < __initcall_end; call++) {

3.BackgroundInformation

Before I start explaining what I found, I think it would be helpful to begin with a bit of a review of the ELF file format and how things get executed in Linux.

Executable File Formats

The product of compiling a C program is some machine language. But raw machine language isn't enough to allow the OS to run your code. The OS will want to know several pieces of meta-information with regards to your program before loading and running it; such as:

  • information necessary to allow dynamic linking
  • information about the size of your executable
  • how much executable code and how much data space is used by your application
  • how to lay out the application in memory
  • debugging information
  • ...and many many other things.

One way to solve this problem is to use a file format, a file format that contains not only the raw machine language code, but all the required additional information. There have been many such file formats devised over the years. Ones that I am familiar with include:

  • DOS .exe
    • A simple format consisting of a header (which contains all the required meta-information) plus the code itself. Very similar to most bitmap graphics file formats.
  • DOS .com
    • A "file format" that contains no meta-information whatsoever. This "format" is literally a raw dump of the machine language. All such meta-information is held in assumptions, the OS makes assumptions for any information it needs, and the programmer must follow these assumptions. Therefore, in a sense, it does "contain" meta-information, it's just that this information isn't contained anywhere in the file itself.
  • a.out
    • A fairly basic file format based on the notion of sections. Unfortunately, it was created before shared libraries became main-stream and couldn't specify the dynamic linking information easily. It also does not allow for an arbitrary number of sections.
  • ELF
    • A nicer format that expands on the ideas of the a.out format and adds additional features.

Linker

The linker (in our case, GNU ld) is responsible for taking the raw machine code from the compiler/assembler and creating a valid ELF file. ELF files consist of several different sections and have many different abilities. In other words, laying out an ELF file can be quite involved considering all the options that are available and all the information that needs to be stored. The linker, therefore, uses a script which helps guide how the output file is laid out. If you don't supply a script, there is an implicit default one provided for you.

The different sections and their attributes are used by, for example, the operating system's loader (when you want to execute an application). For example, some sections contain executable code that needs to be loaded into memory, other sections don't contain any data at all, but rather they instruct the loader to allocate some memory for the application to use (sometimes this memory even needs to be explicitly zeroed). Some sections aren't used by the loader at all, for example, sections that contain debugging information are of no use to the loader but are used by a debugger instead.

By default, executable code you write ends up in a section called .text, initialised data in a .data section, read-only data in .rodata, uninitialised data in .bss, and so on.

Compiler

All compilers have their own little extensions built into them which either extend the language in some way or provide the programmer with #pragma-like control of the environment; GCC is no exception. Of particular interest is the ability GCC gives the developer to specify the name of the section into which to place some object. The name of the ELF segment into which this object will be placed is only one of numerous such attributes the belong to this object.

4.SimpleExamples

Before/After main()

Specifying an attribute gives the compiler information about how an object is intended to be used, thereby allowing it to not only better optimize your code but to also perform additional checks for you. In general (using gcc) attributes can be specified for functions, variables, and types. Full information can be found by visiting the and looking for the relevant subsections on attributes.

Unless you've stumbled across this before, you probably thought that the first line of your main() is the first line of code that gets executed when your executible is run. This isn't true. There are a number of functions that run before your main() gets called. Then, after your main() terminates, a number of additional clean-up routines are also called. Your main() is just one of several functions for the loader to run.

gcc allows you to specify functions it should call during the phase before main() is called as well as functions to call during the phase after main() is done. The following code demonstrates this and serves as an example of how to specify attributes on functions.

/*
* Copyright (C) 2006 Trevor Woerner
*/

#include

void my_ctor (void) __attribute__ ((constructor));
void my_dtor (void) __attribute__ ((destructor));

void
my_ctor (void)
{
printf ("hello before main()\n");
}

void
my_dtor (void)
{
printf ("bye after main()\n");
}

int
main (void)
{
printf ("hello\nbye\n");
return 0;
}

Compiling and running the above yields:

[trevor@trevor code]$ make beforeafter
cc beforeafter.c -o beforeafter
[trevor@trevor code]$ ./beforeafter
hello before main()
hello
bye
bye after main()

Section and Object Layout

For this example we're going to build ./main composed of main.c and add.c:

/*
* Copyright (C) 2006 Trevor Woerner
*/

#include

int add (int, int);
int global_val;
int gval_init = 0;

int
main (void)
{
int local_val = 25;
global_val = 17;

printf ("local_val: %d global_val: %d gval_init: %d\n",
local_val, global_val, gval_init);
printf ("%d + %d = %d\n", local_val, global_val,
add (local_val, global_val));

return 0;
}

/*
* Copyright (C) 2006 Trevor Woerner
*/

int
add (int i, int j)
{
return i+j;
}

By the way, the local_val, gval_init, and global_val were just added so we could see which sections they end up in.

Compiling is a simple process of:

[trevor@trevor code]$ gcc -c main.c
[trevor@trevor code]$ gcc -c add.c
[trevor@trevor code]$ gcc -o main main.o add.o

Here is a dump of the info from add.o:

[trevor@trevor code]$ objdump -t add.o

add.o: file format elf32-i386

SYMBOL TABLE:
00000000 l df *ABS* 00000000 add.c
00000000 l d .text 00000000 .text
00000000 l d .data 00000000 .data
00000000 l d .bss 00000000 .bss
00000000 l d .note.GNU-stack 00000000 .note.GNU-stack
00000000 l d .comment 00000000 .comment
00000000 g F .text 0000000b add

Here is a similar dump of main.o:

[trevor@trevor code]$ objdump -t main.o

main.o: file format elf32-i386

SYMBOL TABLE:
00000000 l df *ABS* 00000000 main.c
00000000 l d .text 00000000 .text
00000000 l d .data 00000000 .data
00000000 l d .bss 00000000 .bss
00000000 l d .rodata 00000000 .rodata
00000000 l d .note.GNU-stack 00000000 .note.GNU-stack
00000000 l d .comment 00000000 .comment
00000000 g O .bss 00000004 gval_init
00000000 g F .text 0000007d main
00000004 O *COM* 00000004 global_val
00000000 *UND* 00000000 printf
00000000 *UND* 00000000 add

Here is a dump of the final executable:

[trevor@trevor code]$ objdump -t main

main: file format elf32-i386

SYMBOL TABLE:
08048114 l d .interp 00000000 .interp
08048128 l d .note.ABI-tag 00000000 .note.ABI-tag
08048148 l d .hash 00000000 .hash
08048174 l d .dynsym 00000000 .dynsym
080481d4 l d .dynstr 00000000 .dynstr
08048234 l d .gnu.version 00000000 .gnu.version
08048240 l d .gnu.version_r 00000000 .gnu.version_r
08048260 l d .rel.dyn 00000000 .rel.dyn
08048268 l d .rel.plt 00000000 .rel.plt
08048280 l d .init 00000000 .init
08048298 l d .plt 00000000 .plt
080482d8 l d .text 00000000 .text
08048488 l d .fini 00000000 .fini
080484a4 l d .rodata 00000000 .rodata
080484ec l d .eh_frame 00000000 .eh_frame
080494f0 l d .ctors 00000000 .ctors
080494f8 l d .dtors 00000000 .dtors
08049500 l d .jcr 00000000 .jcr
08049504 l d .dynamic 00000000 .dynamic
080495cc l d .got 00000000 .got
080495d0 l d .got.plt 00000000 .got.plt
080495e8 l d .data 00000000 .data
080495f4 l d .bss 00000000 .bss
00000000 l d .comment 00000000 .comment
00000000 l d *ABS* 00000000 .shstrtab
00000000 l d *ABS* 00000000 .symtab
00000000 l d *ABS* 00000000 .strtab
080482fc l F .text 00000000 call_gmon_start
00000000 l df *ABS* 00000000 crtstuff.c
080494f0 l O .ctors 00000000 __CTOR_LIST__
080494f8 l O .dtors 00000000 __DTOR_LIST__
08049500 l O .jcr 00000000 __JCR_LIST__
080495f4 l O .bss 00000001 completed.4583
080495f0 l O .data 00000000 p.4582
08048320 l F .text 00000000 __do_global_dtors_aux
08048354 l F .text 00000000 frame_dummy
00000000 l df *ABS* 00000000 crtstuff.c
080494f4 l O .ctors 00000000 __CTOR_END__
080494fc l O .dtors 00000000 __DTOR_END__
080484ec l O .eh_frame 00000000 __FRAME_END__
08049500 l O .jcr 00000000 __JCR_END__
08048460 l F .text 00000000 __do_global_ctors_aux
00000000 l df *ABS* 00000000 main.c
00000000 l df *ABS* 00000000 add.c
08049504 g O .dynamic 00000000 _DYNAMIC
080495fc g O .bss 00000004 global_val
080484a4 g O .rodata 00000004 _fp_hw
080494f0 g *ABS* 00000000 .hidden __fini_array_end
080495ec g O .data 00000000 .hidden __dso_handle
08048458 g F .text 00000005 __libc_csu_fini
08048280 g F .init 00000000 _init
080483fc g F .text 0000000b add
080495f8 g O .bss 00000004 gval_init
080482d8 g F .text 00000000 _start
080494f0 g *ABS* 00000000 .hidden __fini_array_start
08048408 g F .text 0000004f __libc_csu_init
080495f4 g *ABS* 00000000 __bss_start
0804837c g F .text 0000007d main
00000000 F *UND* 00000187 __libc_start_main@@GLIBC_2.0
080494f0 g *ABS* 00000000 .hidden __init_array_end
080495e8 w .data 00000000 data_start
00000000 F *UND* 00000039 printf@@GLIBC_2.0
08048488 g F .fini 00000000 _fini
080494f0 g *ABS* 00000000 .hidden __preinit_array_end
080495f4 g *ABS* 00000000 _edata
080495d0 g O .got.plt 00000000 .hidden _GLOBAL_OFFSET_TABLE_
08049600 g *ABS* 00000000 _end
080494f0 g *ABS* 00000000 .hidden __init_array_start
080484a8 g O .rodata 00000004 _IO_stdin_used
080495e8 g .data 00000000 __data_start
00000000 w *UND* 00000000 _Jv_RegisterClasses
080494f0 g *ABS* 00000000 .hidden __preinit_array_start
00000000 w *UND* 00000000 __gmon_start__

The interesting thing for me is how such a small amount of code generates such a large number of segments! Notice how both of the *.o files contain their own .text, .data, and .bss segments. When they are combined into the one final executable main, it contains just one of each such segment (i.e. it makes no distinction about where the specific parts come from, they all get combined into one larger segment of the same name).

If we want to know the linker script that was used (to find out how ld lays out all the sections), all we have to do is pass the --verbose flag to ld via gcc (like this: gcc -Wl,--verbose ...) and we will get the linker script spat out on stderr. Here is the linker script that I get for this code:

/* Script for -z combreloc: combine and sort reloc sections */
OUTPUT_FORMAT("elf32-i386", "elf32-i386",
"elf32-i386")
OUTPUT_ARCH(i386)
ENTRY(_start)
SEARCH_DIR("/usr/i386-redhat-linux/lib"); SEARCH_DIR("/usr/local/lib"); SEARCH_DIR("/lib"); SEARCH_DIR("/usr/lib");
/* Do we need any of these for elf?
__DYNAMIC = 0; */
SECTIONS
{
/* Read-only sections, merged into text segment: */
PROVIDE (__executable_start = 0x08048000); . = 0x08048000 + SIZEOF_HEADERS;
.interp : { *(.interp) }
.hash : { *(.hash) }
.dynsym : { *(.dynsym) }
.dynstr : { *(.dynstr) }
.gnu.version : { *(.gnu.version) }
.gnu.version_d : { *(.gnu.version_d) }
.gnu.version_r : { *(.gnu.version_r) }
.rel.dyn :
{
*(.rel.init)
*(.rel.text .rel.text.* .rel.gnu.linkonce.t.*)
*(.rel.fini)
*(.rel.rodata .rel.rodata.* .rel.gnu.linkonce.r.*)
*(.rel.data.rel.ro*)
*(.rel.data .rel.data.* .rel.gnu.linkonce.d.*)
*(.rel.tdata .rel.tdata.* .rel.gnu.linkonce.td.*)
*(.rel.tbss .rel.tbss.* .rel.gnu.linkonce.tb.*)
*(.rel.ctors)
*(.rel.dtors)
*(.rel.got)
*(.rel.bss .rel.bss.* .rel.gnu.linkonce.b.*)
}
.rela.dyn :
{
*(.rela.init)
*(.rela.text .rela.text.* .rela.gnu.linkonce.t.*)
*(.rela.fini)
*(.rela.rodata .rela.rodata.* .rela.gnu.linkonce.r.*)
*(.rela.data .rela.data.* .rela.gnu.linkonce.d.*)
*(.rela.tdata .rela.tdata.* .rela.gnu.linkonce.td.*)
*(.rela.tbss .rela.tbss.* .rela.gnu.linkonce.tb.*)
*(.rela.ctors)
*(.rela.dtors)
*(.rela.got)
*(.rela.bss .rela.bss.* .rela.gnu.linkonce.b.*)
}
.rel.plt : { *(.rel.plt) }
.rela.plt : { *(.rela.plt) }
.init :
{
KEEP (*(.init))
} =0x90909090
.plt : { *(.plt) }
.text :
{
*(.text .stub .text.* .gnu.linkonce.t.*)
KEEP (*(.text.*personality*))
/* .gnu.warning sections are handled specially by elf32.em. */
*(.gnu.warning)
} =0x90909090
.fini :
{
KEEP (*(.fini))
} =0x90909090
PROVIDE (__etext = .);
PROVIDE (_etext = .);
PROVIDE (etext = .);
.rodata : { *(.rodata .rodata.* .gnu.linkonce.r.*) }
.rodata1 : { *(.rodata1) }
.eh_frame_hdr : { *(.eh_frame_hdr) }
.eh_frame : ONLY_IF_RO { KEEP (*(.eh_frame)) }
.gcc_except_table : ONLY_IF_RO { KEEP (*(.gcc_except_table)) *(.gcc_except_table.*) }
/* Adjust the address for the data segment. We want to adjust up to
the same address within the page on the next page up. */
. = ALIGN (0x1000) - ((0x1000 - .) & (0x1000 - 1)); . = DATA_SEGMENT_ALIGN (0x1000, 0x1000);
/* Exception handling */
.eh_frame : ONLY_IF_RW { KEEP (*(.eh_frame)) }
.gcc_except_table : ONLY_IF_RW { KEEP (*(.gcc_except_table)) *(.gcc_except_table.*) }
/* Thread Local Storage sections */
.tdata : { *(.tdata .tdata.* .gnu.linkonce.td.*) }
.tbss : { *(.tbss .tbss.* .gnu.linkonce.tb.*) *(.tcommon) }
/* Ensure the __preinit_array_start label is properly aligned. We
could instead move the label definition inside the section, but
the linker would then create the section even if it turns out to
be empty, which isn't pretty. */
. = ALIGN(32 / 8);
PROVIDE (__preinit_array_start = .);
.preinit_array : { KEEP (*(.preinit_array)) }
PROVIDE (__preinit_array_end = .);
PROVIDE (__init_array_start = .);
.init_array : { KEEP (*(.init_array)) }
PROVIDE (__init_array_end = .);
PROVIDE (__fini_array_start = .);
.fini_array : { KEEP (*(.fini_array)) }
PROVIDE (__fini_array_end = .);
.ctors :
{
/* gcc uses crtbegin.o to find the start of
the constructors, so we make sure it is
first. Because this is a wildcard, it
doesn't matter if the user does not
actually link against crtbegin.o; the
linker won't look for a file to match a
wildcard. The wildcard also means that it
doesn't matter which directory crtbegin.o
is in. */
KEEP (*crtbegin*.o(.ctors))
/* We don't want to include the .ctor section from
from the crtend.o file until after the sorted ctors.
The .ctor section from the crtend file contains the
end of ctors marker and it must be last */
KEEP (*(EXCLUDE_FILE (*crtend*.o ) .ctors))
KEEP (*(SORT(.ctors.*)))
KEEP (*(.ctors))
}
.dtors :
{
KEEP (*crtbegin*.o(.dtors))
KEEP (*(EXCLUDE_FILE (*crtend*.o ) .dtors))
KEEP (*(SORT(.dtors.*)))
KEEP (*(.dtors))
}
.jcr : { KEEP (*(.jcr)) }
.data.rel.ro : { *(.data.rel.ro.local) *(.data.rel.ro*) }
.dynamic : { *(.dynamic) }
.got : { *(.got) }
. = DATA_SEGMENT_RELRO_END (12, .);
.got.plt : { *(.got.plt) }
.data :
{
*(.data .data.* .gnu.linkonce.d.*)
KEEP (*(.gnu.linkonce.d.*personality*))
SORT(CONSTRUCTORS)
}
.data1 : { *(.data1) }
_edata = .;
PROVIDE (edata = .);
__bss_start = .;
.bss :
{
*(.dynbss)
*(.bss .bss.* .gnu.linkonce.b.*)
*(COMMON)
/* Align here to ensure that the .bss section occupies space up to
_end. Align after .bss to ensure correct alignment even if the
.bss section disappears because there are no input sections. */
. = ALIGN(32 / 8);
}
. = ALIGN(32 / 8);
_end = .;
PROVIDE (end = .);
. = DATA_SEGMENT_END (.);
/* Stabs debugging sections. */
.stab 0 : { *(.stab) }
.stabstr 0 : { *(.stabstr) }
.stab.excl 0 : { *(.stab.excl) }
.stab.exclstr 0 : { *(.stab.exclstr) }
.stab.index 0 : { *(.stab.index) }
.stab.indexstr 0 : { *(.stab.indexstr) }
.comment 0 : { *(.comment) }
/* DWARF debug sections.
Symbols in the DWARF debugging sections are relative to the beginning
of the section so we begin them at 0. */
/* DWARF 1 */
.debug 0 : { *(.debug) }
.line 0 : { *(.line) }
/* GNU DWARF 1 extensions */
.debug_srcinfo 0 : { *(.debug_srcinfo) }
.debug_sfnames 0 : { *(.debug_sfnames) }
/* DWARF 1.1 and DWARF 2 */
.debug_aranges 0 : { *(.debug_aranges) }
.debug_pubnames 0 : { *(.debug_pubnames) }
/* DWARF 2 */
.debug_info 0 : { *(.debug_info .gnu.linkonce.wi.*) }
.debug_abbrev 0 : { *(.debug_abbrev) }
.debug_line 0 : { *(.debug_line) }
.debug_frame 0 : { *(.debug_frame) }
.debug_str 0 : { *(.debug_str) }
.debug_loc 0 : { *(.debug_loc) }
.debug_macinfo 0 : { *(.debug_macinfo) }
/* SGI/MIPS DWARF 2 extensions */
.debug_weaknames 0 : { *(.debug_weaknames) }
.debug_funcnames 0 : { *(.debug_funcnames) }
.debug_typenames 0 : { *(.debug_typenames) }
.debug_varnames 0 : { *(.debug_varnames) }
/DISCARD/ : { *(.note.GNU-stack) }
}

This might look like it's difficult to read, but it's not. Text within /* and */, as is the same with C code, indicates comments which are ignored. A period by itself ., as is the same in assembly notation, indicates the current value of the output location counter.

At the top of the file is a bunch of housekeeping stuff. Then it gives the SECTIONS command which indicates the start of the script that defines how the sections of the output ELF file are going to be laid out. As an example, let's look at the following lines of code which lay out the .text part of the image (i.e. the part where the executable code is placed):

  .text           :
{
*(.text .stub .text.* .gnu.linkonce.t.*)
KEEP (*(.text.*personality*))
/* .gnu.warning sections are handled specially by elf32.em. */
*(.gnu.warning)
} =0x90909090

This snippet says:

  1. Now I am going to lay out a .text section in the output file, at this point in the output.
  2. This section is going to be composed of all the .text, .stub, .text.*, and .gnu.linkonce.t.* sections I encounter (in that order) from all input files I am given (the * before the parenthesized list indicates which input files to consider).

  3. This is followed by all .gnu.warning sections I encounter from all input files.

  4. The =0x90909090 written at the end of the section's description tells me the fill pattern to use if there is any space between sections (mostly due to alignment constraints).

Just playing around and experimenting some more, here is the objdump -t of the executable again, with most of the cruft removed, and sorted by address:

080482d8 g     F .text  00000000              _start
080482d8 l d .text 00000000 .text
080482fc l F .text 00000000 call_gmon_start
08048320 l F .text 00000000 __do_global_dtors_aux
08048354 l F .text 00000000 frame_dummy
0804837c g F .text 0000007d main
080483fc g F .text 0000000b add
08048408 g F .text 0000004f __libc_csu_init
08048458 g F .text 00000005 __libc_csu_fini
08048460 l F .text 00000000 __do_global_ctors_aux

If I change the compile line to be:

[trevor@trevor code]$ gcc -o main add.o main.o

watch how the positions of the functions main() and add() change place in the executable image:

080482d8 g     F .text  00000000              _start
080482d8 l d .text 00000000 .text
080482fc l F .text 00000000 call_gmon_start
08048320 l F .text 00000000 __do_global_dtors_aux
08048354 l F .text 00000000 frame_dummy
0804837c g F .text 0000000b add
08048388 g F .text 0000007d main
08048408 g F .text 0000004f __libc_csu_init
08048458 g F .text 00000005 __libc_csu_fini
08048460 l F .text 00000000 __do_global_ctors_aux

This happens because the linker script, when creating the .text section, does a wildcard match on all .text sections and joins them together into one single .text section in the order in which they are encountered. During the first compile we specified the order as main.o followed by add.o; therefore the symbols were placed in the executable starting with the symbols from main.o followed by the symbols from add.o. In the second case we specified the object files in the reverse order, therefore the symbols were stored in the executable in the reverse order too.


Putting Objects into their own ELF Sections

I'm going to start with the code that we saw before in the section on code layout and modify it a bit so that different parts will now be in their own ELF sections:

/*
* Copyright (C) 2006 Trevor Woerner
*/

#include

int add (int, int) __attribute__ ((section ("my_code_section")));
int global_val __attribute__ ((section ("my_data_section")));
int gval_init __attribute__ ((section ("my_data_section"))) = 29;

int
add (int i, int j)
{
return i+j;
}

int
main (void)
{
int local_val = 25;
global_val = 17;

printf ("local_val: %d global_val: %d gval_init: %d\n",
local_val, global_val, gval_init);
printf ("%d + %d = %d\n", local_val, global_val,
add (local_val, global_val));

return 0;
}

Now when we do an objdump -t on the result we get the following:

00000000       F *UND*  00000039              printf@@GLIBC_2.0
00000000 F *UND* 00000187 __libc_start_main@@GLIBC_2.0
00000000 w *UND* 00000000 _Jv_RegisterClasses
00000000 w *UND* 00000000 __gmon_start__
00000000 l d *ABS* 00000000 .shstrtab
00000000 l d *ABS* 00000000 .strtab
00000000 l d *ABS* 00000000 .symtab
00000000 l d .comment 00000000 .comment
00000000 l df *ABS* 00000000 crtstuff.c
00000000 l df *ABS* 00000000 crtstuff.c
00000000 l df *ABS* 00000000 new.c
08048114 l d .interp 00000000 .interp
08048128 l d .note.ABI-tag 00000000 .note.ABI-tag
08048148 l d .hash 00000000 .hash
08048174 l d .dynsym 00000000 .dynsym
080481d4 l d .dynstr 00000000 .dynstr
08048234 l d .gnu.version 00000000 .gnu.version
08048240 l d .gnu.version_r 00000000 .gnu.version_r
08048260 l d .rel.dyn 00000000 .rel.dyn
08048268 l d .rel.plt 00000000 .rel.plt
08048280 g F .init 00000000 _init
08048280 l d .init 00000000 .init
08048298 l d .plt 00000000 .plt
080482d8 g F .text 00000000 _start
080482d8 l d .text 00000000 .text
080482fc l F .text 00000000 call_gmon_start
08048320 l F .text 00000000 __do_global_dtors_aux
08048354 l F .text 00000000 frame_dummy
0804837c g F .text 0000007a main
080483f8 g F .text 0000004f __libc_csu_init
08048448 g F .text 00000005 __libc_csu_fini
08048450 l F .text 00000000 __do_global_ctors_aux
08048478 g *ABS* 00000000 __start_my_code_section
08048478 g F my_code_section 0000000b add
08048478 l d my_code_section 00000000 my_code_section
08048483 g *ABS* 00000000 __stop_my_code_section
08048484 g F .fini 00000000 _fini
08048484 l d .fini 00000000 .fini
080484a0 g O .rodata 00000004 _fp_hw
080484a0 l d .rodata 00000000 .rodata
080484a4 g O .rodata 00000004 _IO_stdin_used
080484e8 l O .eh_frame 00000000 __FRAME_END__
080484e8 l d .eh_frame 00000000 .eh_frame
080494ec g *ABS* 00000000 .hidden __fini_array_end
080494ec g *ABS* 00000000 .hidden __fini_array_start
080494ec g *ABS* 00000000 .hidden __init_array_end
080494ec g *ABS* 00000000 .hidden __init_array_start
080494ec g *ABS* 00000000 .hidden __preinit_array_end
080494ec g *ABS* 00000000 .hidden __preinit_array_start
080494ec l O .ctors 00000000 __CTOR_LIST__
080494ec l d .ctors 00000000 .ctors
080494f0 l O .ctors 00000000 __CTOR_END__
080494f4 l O .dtors 00000000 __DTOR_LIST__
080494f4 l d .dtors 00000000 .dtors
080494f8 l O .dtors 00000000 __DTOR_END__
080494fc l O .jcr 00000000 __JCR_END__
080494fc l O .jcr 00000000 __JCR_LIST__
080494fc l d .jcr 00000000 .jcr
08049500 g O .dynamic 00000000 _DYNAMIC
08049500 l d .dynamic 00000000 .dynamic
080495c8 l d .got 00000000 .got
080495cc g O .got.plt 00000000 .hidden _GLOBAL_OFFSET_TABLE_
080495cc l d .got.plt 00000000 .got.plt
080495e4 w .data 00000000 data_start
080495e4 g .data 00000000 __data_start
080495e4 l d .data 00000000 .data
080495e8 g O .data 00000000 .hidden __dso_handle
080495ec l O .data 00000000 p.4582
080495f0 g *ABS* 00000000 __start_my_data_section
080495f0 g O my_data_section 00000004 gval_init
080495f0 l d my_data_section 00000000 my_data_section
080495f4 g O my_data_section 00000004 global_val
080495f8 g *ABS* 00000000 __bss_start
080495f8 g *ABS* 00000000 __stop_my_data_section
080495f8 g *ABS* 00000000 _edata
080495f8 l O .bss 00000001 completed.4583
080495f8 l d .bss 00000000 .bss
080495fc g *ABS* 00000000 _end

Running the executable gives:

local_val: 25    global_val: 17    gval_init: 29
25 + 17 = 42

The first thing to note is that the executable works! (yea!) The second thing you should notice are the existance of new section names (my_code_section and my_data_section) in the executable image. You will also notice that in these sections are found the objects that we placed in them.

...
08048478 g *ABS* 00000000 __start_my_code_section
08048478 g F my_code_section 0000000b add
08048478 l d my_code_section 00000000 my_code_section
08048483 g *ABS* 00000000 __stop_my_code_section
...
080495f0 g *ABS* 00000000 __start_my_data_section
080495f0 g O my_data_section 00000004 gval_init
080495f0 l d my_data_section 00000000 my_data_section
080495f4 g O my_data_section 00000004 global_val
080495f8 g *ABS* 00000000 __bss_start
080495f8 g *ABS* 00000000 __stop_my_data_section

Something else that is very worthy of note is the fact that ld has been kind enough to add a couple of global absolute symbols which delimit our newly-defined sections without us needing to ask it to: __start_my_code_section/__stop_my_code_section and __start_my_data_section/__stop_my_data_section. Notice how __start_my_code_section has the same address as our add() function and that __start_my_data_section has the same address as our gval_init.

You may have just asked yourself: "In the generated my_data_section above, why did the gval_init object come first?". Having a look at the generated assembly (gcc -S) helps us to investigate this question:

.globl gval_init
.section my_data_section,"aw",@progbits
.align 4
.type gval_init, @object
.size gval_init, 4
gval_init:
.long 29
.section my_code_section,"ax",@progbits
.globl add
.type add, @function
add:
pushl %ebp
movl %esp, %ebp
movl 12(%ebp), %eax
addl 8(%ebp), %eax
leave
ret
.size add, .-add
.section .rodata
.align 4
.LC0:
.string "local_val: %d global_val: %d gval_init: %d\n"
.LC1:
.string "%d + %d = %d\n"
.text
.globl main
.type main, @function
main:
pushl %ebp
movl %esp, %ebp
subl $24, %esp
andl $-16, %esp
movl $0, %eax
addl $15, %eax
addl $15, %eax
shrl $4, %eax
sall $4, %eax
subl %eax, %esp
movl $25, -4(%ebp)
movl $17, global_val
movl gval_init, %eax
movl global_val, %edx
pushl %eax
pushl %edx
pushl -4(%ebp)
pushl $.LC0
call printf
addl $16, %esp
movl global_val, %eax
pushl %eax
pushl -4(%ebp)
call add
addl $8, %esp
movl global_val, %edx
pushl %eax
pushl %edx
pushl -4(%ebp)
pushl $.LC1
call printf
addl $16, %esp
movl $0, %eax
leave
ret
.size main, .-main
.globl global_val
.section my_data_section
.align 4
.type global_val, @object
.size global_val, 4
global_val:
.zero 4

Notice how the two variables ended up in different sections (but which both have the same name). Why did this happen? I'm not 100% sure why, but the gcc sub-program that converts the C code into assembly did originally setup two different sections for the two variables, but because the two sections have the same name, they ended up together. I don't know why the initialized global data ended up at the top and the non-intialized one ended up at the bottom. Perhaps this is something to explore some other day.

Basically, the answer to the above question "why gval_init ended up first" is that gcc separated them that way. If we make them both the same type of global variable we'll see that gcc will only create one segment for both of them, and that they'll appear in our segment in the order in which they're found in the source code:

code:
int global_val __attribute__ ((section ("my_data_section")));
int gval_init __attribute__ ((section ("my_data_section")));

assembly: (at the bottom of file)
.globl global_val
.section my_data_section,"aw",@progbits
.align 4
.type global_val, @object
.size global_val, 4
global_val:
.zero 4
.globl gval_init
.align 4
.type gval_init, @object
.size gval_init, 4
gval_init:
.zero 4

objdump -t:
080495f0 g *ABS* 00000000 __start_my_data_section
080495f0 g O my_data_section 00000004 global_val
080495f0 l d my_data_section 00000000 my_data_section
080495f4 g O my_data_section 00000004 gval_init
080495f8 g *ABS* 00000000 __bss_start
080495f8 g *ABS* 00000000 __stop_my_data_section

code:
int gval_init __attribute__ ((section ("my_data_section")));
int global_val __attribute__ ((section ("my_data_section")));

assembly:
.globl gval_init
.section my_data_section,"aw",@progbits
.align 4
.type gval_init, @object
.size gval_init, 4
gval_init:
.zero 4
.globl global_val
.align 4
.type global_val, @object
.size global_val, 4
global_val:
.zero 4

objdump -t:
080495f0 g *ABS* 00000000 __start_my_data_section
080495f0 g O my_data_section 00000004 gval_init
080495f0 l d my_data_section 00000000 my_data_section
080495f4 g O my_data_section 00000004 global_val
080495f8 g *ABS* 00000000 __bss_start
080495f8 g *ABS* 00000000 __stop_my_data_section

5.HowItWorks

Armed with all of the above information, we're now ready to understand how the Linux kernel's initcall mechanism works. In fact, if you've understood most of what has been said up to this point, you already understand how it works; you might want to stop reading now and explore it on your own!

When you write a Linux kernel device driver there is a simple template that you follow. Following this template, together with some entries into the build system, a user can compile your driver either into the kernel or as a loadable module. All drivers, when loaded, have an opportunity to run a one-time initialization function. After this function is called it will never be called again for the duration of the time your driver is loaded. If your driver is used as a module, this one-time initialization function will be called when the driver is loaded. If your driver is compiled into the kernel, this one-time function is called as the system boots up. Having a kernel that has a fair amount of memory used by functions that are called once as the machine is brought up and will never be called again is a considerable waste. Therefore the kernel developers have arranged it such that all this code is put into its own ELF segment which is then tossed away once the machine is up and running (and has passed the initialization phase).

Dumping a whole bunch of code into a separate segment at compile time is a nice idea, but how to you then call all those functions at run time? The functions aren't all the same length, and it wouldn't be a very productive idea to force them all to be! Therefore it isn't possible to step through the code segment, calling functions as you go along. Although the function definitions themselves aren't the same length, luckily pointers to functions are all the same length (on the same system) so we can therefore build a table of pointers to all the initialization functions to call and step through this table calling each one in turn. Since this table is also something that is only needed at initialization time it makes sense to also put the table of function pointers into its own segment so that it too can be reclaimed after the initialization phase is complete.

Notice that the above trick of putting the initialization code into one segment and the initialization function pointer call table into another segment (both of which can be released once the machine is up and running) is only used when a device driver is compiled into the kernel. If the device driver is compiled as a module then the initialization code is handled differently.

The decision as to whether to compile something into the kernel or as a module is made not at code-writing time by the device driver writer, but at kernel configuration and build time, sometimes by someone other than the device driver writer. It is important to try to use the same code for both situations, and it makes a lot of sense to make these things very easy to handle and code for the person writing the device driver. So how are these two situations handled? By writing a bunch of macros and getting the programmers to follow a template.

I have distilled the Linux device driver writing template for a very simple driver into the following code. I have found and expanded the macros for the situation where we want to create a driver that is built into the Linux kernel. Also note that if you want to write your own device drivers and are just learning, this is not what your code would look like at all since device drivers do not contain a main()! I wrote this code in such a way so that it uses the same ideas and roughly the same code as the kernel, but in such a way that it could be played with as a regular user as code that isn't a device driver.

/*
* Copyright (C) 2006 Trevor Woerner
*/

#include

typedef int (*initcall_t)(void);
extern initcall_t __initcall_start, __initcall_end;

#define __initcall(fn) \
static initcall_t __initcall_##fn __init_call = fn
#define __init_call __attribute__ ((unused,__section__ ("function_ptrs")))
#define module_init(x) __initcall(x);

#define __init __attribute__ ((__section__ ("code_segment")))

static int __init
my_init1 (void)
{
printf ("my_init () #1\n");
return 0;
}

static int __init
my_init2 (void)
{
printf ("my_init () #2\n");
return 0;
}

module_init (my_init1);
module_init (my_init2);

void
do_initcalls (void)
{
initcall_t *call_p;

call_p = &__initcall_start;
do {
fprintf (stderr, "call_p: %p\n", call_p);
(*call_p)();
++call_p;
} while (call_p < &__initcall_end);
}

int
main (void)
{
fprintf (stderr, "in main()\n");
do_initcalls ();
return 0;
}

Let's examine these #define's closely:

1. module_init(x) (calls __initcall(fn))

#define __initcall(fn) \
static initcall_t __initcall_##fn __init_call = fn
#define __init_call __attribute__ ((unused,__section__ ("function_ptrs")))
#define module_init(x) __initcall(x);

is a macro that:

  • takes a function name
  • defines a variable whos name is the concatenation of the string "__initcall_" plus the function's name

  • of type initcall_t (i.e. a function pointer)

  • which has the attributes assigned to it from the expansion of the __init_call macro (which just basically says to put this object (a function pointer) into its own segment called function_ptrs)

  • which is assigned the value of the function's address

This macro could be shortened to:

#define module_init(fn) \
static initcall_t __initcall_##fn __attribute__ ((section ("function_ptrs"))) = fn

with no loss of generality (that I am aware of).

2. __init

is a macro that :tells the compiler to put all of these such objects into their own segment called code_segment

Compiling this code we get... an error:

[trevor@trevor code]$ make initcalls   
cc initcalls.c -o initcalls
/tmp/ccG4XFSM.o(.text+0x9): In function `do_initcalls':
initcalls.c: undefined reference to `__initcall_start'
/tmp/ccG4XFSM.o(.text+0x36):initcalls.c: undefined reference to `__initcall_end'
collect2: ld returned 1 exit status
make: *** [initcalls] Error 1

Oh yea, that's right, there's that symbol that doesn't appear in any of the code anywhere, just in the linker script. That's what got me started on all this in the first place! A linker script is used to make this all work. To be honest I'm not sure why they don't take advantage of the fact that the GNU linker will give you those start and end symbols for free, but there's probably a good reason (or maybe not).

Trying to create a valid linker script by hand from scratch would be a nice exercise, but not something I have the time to investigate. So instead I'll get the linker to tell me what its default linker script is and modify that to generate my required linker script. Following the lead of the kernel's linker scripts I have added the following lines to the linker script:

  __initcall_start = .;
function_ptrs : { *(function_ptrs) }
__initcall_end = .;
code_segment : { *(code_segment) }

Which results in:

[trevor@trevor code]$ gcc -Tlinker.lds -o initcalls initcalls.c 
[trevor@trevor code]$ ./initcalls
in main()
call_p: 0x804850c
my_init () #1
call_p: 0x8048510
my_init () #2

It works!

The relevant objdump -t looks like:

0804850c g       *ABS*  00000000              __initcall_start
0804850c l O function_ptrs 00000004 __initcall_my_init1
0804850c l d function_ptrs 00000000 function_ptrs
08048510 l O function_ptrs 00000004 __initcall_my_init2
08048514 g *ABS* 00000000 __initcall_end
08048514 l F code_segment 0000001d my_init1
08048514 l d code_segment 00000000 code_segment
08048531 l F code_segment 0000001d my_init2

Noticed how if we re-arrange the following lines from the source:

module_init (my_init2);
module_init (my_init1);

The output becomes:

[trevor@trevor code]$ gcc -Tlinker.lds -o initcalls initcalls.c 
[trevor@trevor code]$ ./initcalls
in main()
call_p: 0x804850c
my_init () #2
call_p: 0x8048510
my_init () #1

and

0804850c g       *ABS*  00000000              __initcall_start
0804850c l O function_ptrs 00000004 __initcall_my_init2
0804850c l d function_ptrs 00000000 function_ptrs
08048510 l O function_ptrs 00000004 __initcall_my_init1
08048514 g *ABS* 00000000 __initcall_end
08048514 l F code_segment 0000001d my_init1
08048514 l d code_segment 00000000 code_segment
08048531 l F code_segment 0000001d my_init2

6.References

Here is a list of the documents that I found helpful while studying this issue. These links were valid and checked at the time this paper was put together. Google (or it's cache) can always help you locate a copy if these links aren't valid anymore.

  • by Tigran Aivazian at The Linux Documentation Project.

  • ELF:From the Programmer's Perspective by Hongjiu Lu linked to from

  • The info page for "ld".
  • The GNU GCC Manual, specifically the parts about specifying attribtues:
    • attribute

    • attributes

    • attributes

    • attributes

BTW:http://kernelnewbies.org/Documents/InitcallMechanism
阅读(2914) | 评论(0) | 转发(0) |
给主人留下些什么吧!~~