Chinaunix首页 | 论坛 | 博客
  • 博客访问: 56846
  • 博文数量: 27
  • 博客积分: 2000
  • 博客等级: 大尉
  • 技术积分: 300
  • 用 户 组: 普通用户
  • 注册时间: 2009-04-24 17:31
文章分类
文章存档

2011年(1)

2010年(8)

2009年(18)

我的朋友

分类: LINUX

2009-05-20 12:21:03

 
Foundations of Computer Systems(计算机基础)
 
第20章 Translating Assembly Code
 
Object file format(目标文件格式)
1. An object file contains the machine code and data for a program, plus other information needed to link and load the program. The machine code and data are
stored in three segments:
text: machine code instructions
data: initialized data (data that have an initial value)
bss: uninitialized data (data that have no initial value)
(目标文件包含程序的机器代码和数据,以及其他用来链接和加载该程序所需的信息。机器代码和数据
储存在三个段:
text:机器代码指令
data:初始化数据(数据,有一个初始值)
bss:未初始化的数据(数据没有初始值))
2. The segment names are historical. The bss segment holds data that is not statically initialized in the program, and is represented in the object file by the
amount of memory needed to hold it.
(段的名字是历史称谓。bss段包含非静态的初始化的数据,在目标文件中表现为其所占内存空间的数值。)
3. The other two segments contain information that must be copied from the file into memory before the program runs.
(其他两个段包含的数据在程序运行之前必须从文件复制到内存中。)
4. An object file also contains a symbol table and a patch list. These will be explained below.
(目标文件还包含一个符号表和一个修补程序列表。后面将会解释说明。)
 
 
assembly -> machine -> running program
We've learned about machine code, the binary instructions that a computer understands, and assembly code, the human-readable form of machine code.
(我们已经了解了机器代码,二进制指令的计算机理解和汇编代码,人类可读形式的机器代码。)
How is assembly code converted into machine code, and ultimately a running program? We need three programs: the assembler, the linker, and the loader.
(汇编代码如何转换成机器代码,并最终运行程序呢?我们需要三个程序:汇编器,链接器,和加载器。)
       assembler       linker          loader
foo.s      --->  foo.o     ---> a.out    ---> running program
source files     object files   executable


Assembler(汇编器)
The assembler converts one source (assembly code) file into one object file. It performs the following functions:
(汇编器将一个源代码文件(汇编代码)转换为一个目标文件。它履行以下职能:)
1. Translate assembly instructions into machine code instructions. This involves setting the bits properly in the machine code.
(翻译汇编指令到机器代码指令。这涉及到在机器代码中设置正确的bits位。)
2. Implement synthetic instructions using one or more machine instructions.
(利用一个或多个机器指令实现合成指令。)
3. Evaluate constant expressions. [addu $sp, $sp, 24+8] becomes [addu $sp, $sp, 32].
(计算常数表达式。)
4. Organize data and text in memory. This is difficult because a source file can have multiple data and text segments: the assembler might not know the final address for a label when it is translating an instruction that uses the label.
(组织data和text在内存的地址。这是困难的,因为一个源文件可以有多个data和text段:汇编器在翻译含有label的指令时,可能不知道label的地址。)
5. Labels are a form of symbol - a symbolic name for an address.
(label也是一种符号形式 - 一个地址的符号化名字)
 
 
Assembler - Algorithm
 
void assembler(file) {
    dotOFile = createFile();                 # dotOFile 是汇编产出的目标文件
    syTab = new SymbolTable();               # syTab是符号表
    PC = 0;                                  # PC是位置计数器
    for line = each line in file do {        # file 是源文件
        if hasLabel(line) then
            syTab.insert(getLabel(line), PC)
        instr = translate(line, SyTab, PC);  # instr是翻译出的机器代码
        if instr != NULL then {
            write(dotOFile, instr);
            PC += sizeOf(instr);
        }
    }
}
 
 
Assembler - Forward References
loop: beq $t0, 4, done
      add $t0, $t0, 1
      b loop
done:
 
1. This program uses symbols (labels). Not knowing the value of done, the assembler can't translate the beq instruction.
(上面程序使用了符号(label)。不知道done的值,汇编器无法翻译beq指令。)
2. The assembler must defer translating this instruction until the address of done has been determined.
(汇编必须推迟翻译这一指令,直到done的地址被确定出来。)
3. There are two ways of solving the problem, both involving first figuring out the symbol addresses, then using them to generate the machine code.
(有两种方法解决这个问题,都涉及到要首先搞清楚符号的地址,然后利用他们生成机器代码。)
 
 
Two-pass assembler
The first solution is to make two passes over the source file.
This is called a two-pass assembler.
Pass 1: Compute segment sizes and starting addresses, and fill in the symbol table. The symbol table contains the name of each symbol, and its address. When the
assembler sees a symbol defined (such as loop in the first instruction above) it adds it to the symbol table.
(计算段的大小和起始地址,并填写进符号表。符号表中包含每一个符号的名称和它的地址。当
汇编器看到一个符号被定义了(例如:上面的第1条指令里的loop符号),就把该符号添加到符号表。)
The symbol addresses are left unknown. At the end of the first pass the size of the segments are known, and the symbol addresses are computed and stored in the
symbol table.
(符号地址未知暂且不管。在第一阶段结束时,段的大小已知,于是符号地址被计算出来,储存在
符号表里。)
 
Pass 2: Translate instructions. If an instruction uses a symbol, find the symbol in the symbol table and use its address during the translation.
(翻译指令。如果指令中使用了符号,到符号表里找到此符号,然后在翻译过程中使用此符号的地址。)
This works, but requires reading the source file twice, which can be slow.
(这种方法可行,但需要读两遍源文件,速度缓慢。)
 
Two-pass - Algorithm
void assembler(file) {
    syTab = passOne(file);
    rewind(file);
    passTwo(file, syTab);
}

SymbolTable passOne (file) {
    syTab = new SymbolTable();
    PC = 0;
    for line = each line in file do {
        if hasLabel(line) then {
            syTab.insert(getLabel(line), PC)
            PC += bytesNeeded(line);
        }
    }
    return syTab;
}
 
void passTwo(syTab, file) {
    dotOFile = createFile();
    PC = 0;
    for line = each line in file do {
        instr = translate(line, syTab, PC);
        if instr != NULL then {
            write(dotOFile, instr);
            PC += sizeOf(instr);
        }
    }
}
 
 
Patch List(补丁列表)
Another approach is to make one pass over the source file, partially translating the instructions, but make a second pass over the object file to fix (patch) those
instructions that use symbols. A patch list is used to keep track of which instructions need to be patched, and which symbol(s) each uses.
(另一种办法是第一步处理源文件,先部分的翻译指令,第二步处理目标文件修补这些使用了符号的指令。补丁清单用来跟踪哪些指令需要进行修补,而这些指令都使用了符号。)

Phase 1: Read entire source file, fill in symbol table, translate instructions (leave unknown addresses blank), and generate patch list. At the end of phase 1 the sizes and starting addresses of the segments are known, and the symbol addresses are computed and stored in the symbol table.
(第1阶段:通读整个源文件,填写符号表,翻译指令(如果地址未知就留着空白),并生成补丁列表。在第一阶段结束时段的大小和开始地址已知,符号的地址被计算出来存储在符号表。)
Phase 2: Process patch list, and patch each instruction using the address of the appropriate symbol(s).
(第二阶段:处理补丁列表,用对应的符号地址修补每一指令。)
 
Patch List - Algorithm
void assembler(file) {
    dotOFile = createFile();
    (patchList,syTab) = passOne(file, dotOFile);
    passTwo(dotOFile, patchList, syTab);
}
 
passOne(file, dotOFile) {
    syTab = new SymbolTable();
    PC = 0;
    patchList = new List();
    for line = each line in file do {
        if hasLabel(line) then
            syTab.insert(getLabel(line), PC)
        instr = translate(line, SyTab, PC);
        if instr != NULL then {
            if incomplete(line) then {
                patch = new Patch(instr, PC);
                patchList.add(patch);
            }
            write(dotOFile, instr);
            PC += sizeOf(instr);
        }
    }
    return (patchList, syTab);
}

void passTwo(dotOFile, patchList, syTab) {
    for patch = each element on patchList do {
        applyPatch(dotOFile, patch, syTab);
    }
}
 
 
Patch List: Caveats
1. Symbol addresses aren't known until the end of phase 1. Store each symbol address in the symbol table as an offset from the beginning (base address) of its segment. During phase 2, add the offset to the segment base to get the symbol address.
(符号地址开始是不知道的,第1阶段结束时把每个符号地址(符号相对于本段基地址的偏移量)保存在符号表。在第2阶段,偏移量+段基地址得到符号的真实地址。)

2. Different instructions have different encodings, which means that patching is instruction-dependent:
(不同的指令,有不同的编码,这意味着修补是和具体指令相关的:)
(a) "lw $t0, foo" is translated into 0x8C080000 in the first pass. The last four zeros (16 bits) should be patched with the address of foo when the patch list
is processed.
"lw $t0, foo" 在第1阶段翻译为:0x8C080000。在第2阶段处理补丁列表时,后面4个零(共16bit位)应修补改为foo的地址。)
(b) "jal foo" is translated into 0x0C000000 in the first pass. The last six zeros plus two bits of the 'C' (26 bits total) should be patched with the word address
of foo. The low 2 bits of foo's address will always be zero (because instructions must be word-aligned), so leave them off.
("jal foo" 在第1阶段翻译为:0x0C000000。最后6个零加上'C'的最后2bit位(共26bit位)应修补改为foo的地址(word类型)。foo地址(26bit位)的最低2个bit位总是零(因为指令必须按word=4bytes对齐),所以别管它。)
 
3. Either(或者):
(a) Read the instruction to be patched from the object file, parse it, and patch it depending on what instruction it is, or
(从目标文件中读取进行修补的指令,分析它,并修补它依据它是何种指令,或者)
(b) Store the type of instruction in the patch list to avoid re-parsing the instruction, or
(把指令类型保存进补丁列表以避免重复分析指令,或者)
(c) Store generic patching information in the patch list (e.g. put the address into bits 0-15 of the instruction). This solution doesn't depend on the
instruction encoding or instruction set - it can easily be ported to a different architecture.
(把通用的修补信息保存进补丁列表(比如把地址放在机器指令中的0-15bit位)。这个解决方案不依赖于
指令编码或指令集-它可以很容易地移植到不同的硬件架构上。)
 

Branch instructions
1. Branch instructions are PC-relative, they modify the PC by a relative amount, rather than set it to a fixed address.
(分支(转移)指令使用相对位置(而非绝对地址),这些指令用相对偏移量修改位置计数器(PC),而不是将位置计数器(PC)设置为一个固定的地址。)
2. This is a good thing, because it means a branch to an instruction in the same segment works even if the segment's starting address changes (because it's relative).
(这是一件好事,因为这意味着即使一个段的起始地址改变了,在同一个段内部的分支指令仍可正常工作,(因为它是相对的偏移,而非绝对地址)。)
3. Once PC-relative instructions have been patched, they don't need to be re-patched if the segment is relocated in memory.
(使用相对位置的指令只要修补1次就好了,即使段在内存中的位置改变了,他们也不需要重新修补。)
 
 
Linker
The linker combines one or more object (.o) files into a single executable (a.out). To do this it must:
(连接器把一个或多个目标文件(.o)合并为一个单一的可执行文件(a.out) 。为此,必须:)
1. Combine like segments. All text segments from the object files are combined into one big text segment, and all data segments are combined into one big data segment.
(合并相似的段。所有目标文件的text段合并成一个大的text段,所有的数据段合并成一个大的数据段。)
2. Segment starting addresses will change, so instructions will have to be re-patched. The symbol tables and patch lists stored in the object files are used for this.
(段的开始地址会改变,因此指令也必须跟着重新修补。目标文件里保存的符号表和修补清单就是用来干这个的。)
3. Resolve external references. If one object file has an undefined reference to a symbol, the linker searches the global symbols defined in the other object files to find a match and patch the instruction. Otherwise an error is reported.
(解决外部引用。如果一个目标文件引用一个未定义符号,连接器搜索其他目标文件中定义的全局符号,找到匹配的符号修补指令。否则报告一个错误。)
4. Link in object files from specified libraries. A library is a collection of object files that are indexed by the global symbols they define. If an undefined reference is found in the index, the linker extracts the proper object file from the library and links it into the executable.
(链接指定的库到目标文件。库是多个目标文件的压缩包,包含了所有目标文件定义的全局符号的索引。如果索引中找到某个未定义的引用,连接器从库里面提取出对应的目标文件,链接到可执行文件里。)

When patching instructions the linker must be careful to use the correct symbol table, as several object files might define and use the same symbol. There are two types of symbols:
(在修补指令时连接器必须注意使用正确的符号表,因为多个目标文件可能会定义并使用相同的符号。有两种类型的符号:)
Private: a private symbol is defined and used in a source file. It cannot be referenced by another source file.
(私有型:私有变量只能定义和使用在同一个源文件内。不能被其他源文件引用。)
Global: a global symbol is defined in one source file, but may be referenced by any source file.
(全局型:全局变量定义在一个源文件内。但可被任何源文件引用。)
Global definition: the definition of a global symbol. In MIPS assembly, the .globl directive indicates that the symbol is global. Symbols without this directive are
private.
(全局定义:即全局符号的定义定义。在MIPS汇编中, .globl指令表明符号是全局的。没有这个指令的符号就是私有的。)
External reference: a reference to a global symbol that is not defined in the current source file.
(外部引用:对未定义在当前源文件中的全局符号的引用。)
 
 
Loader
1. The loader is responsible for taking an executable(a.out) and turning it into a running program.
(加载器负责拿来可执行文件(a.out),使之转为运行的程序。)
2. The a.out was linked assuming that the text segment starts at address 0, but what if there is already a program running at address 0? The new program will
clearly have to start at a different address, meaning that many of its instructions have been patched with incorrect addresses and need to be patched again.
(a.out被连接时假设text段开始于地址0,但是如果已经有一个程序运行在地址0怎么办呢?显然新的程序
不得不使用其他的起始地址,这意味着许多前面已经修补过的指令的地址又不正确了,还需要再修补一次。)
3. The loader using the symbol tables and patch lists to re-patch the instructions. Once this is done, the text and data segments are copied into memory, and the loader starts the program running at the executable's
starting address.
(加载器使用符号表和修补列表重新修补程序指令。修补完成后,text段和data段被复制到内存中,加载器让可执行程序从启动地址开始执行。)
 
An Example
1. The following example converts an upper-case string to lower-case and prints it. The file main.s has two text segments and two data segments and calls the routine print which is defined in the file print.s
(下面的例子将一个大写字符串转换为小写然后打印它。main.s有两个text段和两个data段,并调用print函数,定义在print.s文件里。)
2. The notation hi(symbol) means the upper 16 bits of the symbol's address.
(hi(symbol) 取符号的地址的高16bit。)
3. lo(symbol) means the lower 16 bits.
(lo(symbol) 取符号的地址的低16bit。)
4. xspim doesn't have syntax to express this.
(xspim没有语法来表达这一点。)
 
main.s
           .data
string:    .asciiz"HELLO"
           .text
main:
1   subu   $sp, $sp, 24
2   sw     $fp, 0($sp)
3   addu   $fp, $sp, 24
4   sw     $ra, -20($fp)
5   lui    $a0, hi(string) # la $a0,string
6   ori    $a0, $a0, lo(string)
7   jal    tolower
8   move   $a0, $v0
9   jal    print
10  lw     $ra, -20($fp)
11  lw     $fp, -24($fp)
12  addu   $sp, $sp, 24
13  jr     $ra
 
           .data
           .align 2
offset:    .word 0x20
           .text
tolower:
1   subu   $sp, $sp, 24
2   sw     $fp, 0($sp)
3   addu   $fp, $sp, 24
4   lui    $at, hi(offset) # lw $t1, offset
5   lw     $t1, lo(offset)($at)
6   lbu    $t0, 0($a0)
7   beqz   $t0, done
loop:
8   addu   $t0, $t0, $t1
9   sb     $t0, 0($a0)
10  addu   $a0, $a0, 1
11  lbu    $t0, 0($a0)
12  bnez   $t0, loop
done:
13  lw     $fp, -24($fp)
14  addu   $sp, $sp, 24
15  jr     $ra
 
 
Assemble main.s
1. First assemble main.s into main.o(第1步 main.s汇编为main.o)
The assembler processes main.s one line at a time, and creates a symbol table and patch list.
(汇编器每次处理main.s的一行,建立符号表和补丁列表。)
 

main.s Symbol Table

Symbol

Address

Type

string

data+0

Private

main

text+0

Private

tolower

text+52

Private

print

undefined

External ref

offset

data+8

Private

loop

text+80

Private

done

text+100

Private

2. The value of offset is data+8 because the string string is stored at the beginning of the data segment.
(标号offset的值是data+8,是因为字符串string位于data段的开头。)
string has 6 characters (don't forget the 0!), but offset must be word-aligned.
(string="HELLO\0" 占用6个字符(不要忘记末尾的0),但是偏移必须是按“字”(4bytes)对齐的(也即必须是4的倍数)。)
3. After all lines in main.s have been processed, the size of the data
segment is known to be 12 bytes, and the text segment 112 bytes (28 instructions).
(main.s里的每1行处理完成后,可以知道data段的大小是12 bytes,text段大小是112 bytes(共28条指令)。)

main.s Segment Info

Segment

Base Addr.

Size

Text

0x0

0x70(112)

Data

0x70

0xc(12)

main.s Patch List

Instruction address

Symbol used

text+16

string

text+20

string

text+24

tolower

text+32

print

text+64

offset

text+68

offset

text+76*

done

text+96*

loop


4. The instructions tagged with `*' are branch instructions that are patched with the number of instructions difference between the instruction's address and the symbol's address. E.g., to branch backwards one instruction the branch is patched with -1.
(指令带"*"标记的是分支指令,用该分支指令和符号地址之间的偏移数值来修正。例如,分支向下跳转一个指令,分支指令就用-1修补。)
5. A branch can be patched once the offset is known, even if the final addresses aren't. In the case of the "bnez loop" the instruction can be patched in the first pass because loop is at offset 80 and the branch instruction at offset 96. The difference is -16 bytes, or -4 instructions.
(只要知道偏移数值,分支指令就可修正,即使不知道最终地址。比如:"bnez loop"指令在第1阶段就可以修正,因为符号"loop"是在偏移80处,分支指令"bnez loop"是在偏移96处。差值是 -16 字节,或 -4 指令。)

Applying patches

Address patched

Symbol used

Symbol address

Comments

0x10 (text+16)

string

0x70 (data+0)

upper 16 bits

0x14 (text+20)

string

0x70 (data+0)

lower 16 bits

0x18 (text+24)

tolower

0x34 (text+52)

bits 27-2

0x20 (text+32)

print

undefined

external ref.

0x40 (text+64)

offset

0x78 (data+8)

upper 16 bits

0x44 (text+68)

offset

0x78 (data+8)

lower 16 bits

0x4C (text+76)

done

0x64 (text+100)

+6

0x60 (text+96)

loop

0x50 (text+80)

-4

Address

Before patch

After patch

0x10

0x3C040000

0x3C040000

0x14

0x34840000

0x34840070

0x18

0x0c000000

0x0C00000D

0x20

0x0c000000

0x0C000000

0x40

0x3C010000

0x3C010000

0x44

0x8C290000

0x8C290078

0x4C

0x11000000

0x11000006

0x60

0x15000000

0x1500FFFC

Assemble print.s
Now do the same for print.s, containing the print subroutine.
print.s

        .data
newline:.asciiz"\n"
        .text
        .globlprint
print:
1    subu    $sp, $sp, 24
2    sw      $fp, 0($sp)
3    addu    $fp, $sp, 24
4    li      $v0, 4
5    syscall
6    lui     $a0, hi(newline)
7    addu    $a0, $a0, lo(newline)
8    syscall
9    lw      $fp, -24($fp)
10   addu    $sp, $sp, 24
11   jr      $ra
 

print.s Symbol Table

Symbol

Address

Type

newline

data+0

Private

print

text+0

Global

 

print.s Segment Info

Segment

Base

Address

Text

0x0

0x2C (44)

Data

0x2C

0x2 (2)

 

print.s Patch List

Instruction address

Symbol used

 

text+20

newline

 

text+24

newline

 

 
Applying patches

Address patched

Symbol used

Symbol address

Comments

0x14 (text+20)

newline

0x2C (data+0)

high 16 bits

0x18 (text+24)

newline

0x2C (data+0)

low 16 bits

 
 
Linking
1. The assembly process produces two object (.o) files, main.o and print.o. Each contains a text segment with the machine instructions, data segment, symbol
table, and patch list.
(汇编过程产生两个目标文件, main.o和print.o 。每个文件都包含有text段,data段,符号表,补丁列表。)
2. The linker combines the object files into an executable. It opens both files and computes the sizes and starting addresses of the combined addresses.
(连接器把目标文件合并成一个可执行文件。它打开这两个文件,计算大小和合并后的开始地址。)
3. Assume that the main.s text segment starts at address 0, followed by the print.s text segment, followed by the main.s data segment, and finally the print.s data segment:
(假设main.s的text段开始于地址0,紧接着是print.s的text段,接着是main.s的data段,最后是print.s的data段:)
 

program's segment

Segment

Starting Address

Shorthand

main.s text

0x0

mtext

print.s text

0x70

ptext

main.s data

0x9C

mdata

print.s data

0xA8

pdata

 
4. Combining the segments has caused the addresses in three of them to change. I use the shorthand names mtext, ptext, mdata and pdata to represent starting addresses of the segments. The patch lists are used to patch the instructions.
(合并各段造成了3个段的地址发生改变。简写名字mtext , ptext , mdata和pdata代表4个段的开始地址。修补清单是用来修补这些指令。)
5. Note that the instruction at address 0x20 is the call to print that the assembler couldn't patch. The linker uses the global definition of print in print.o to resolve the reference.
(注意,地址0x20处调用print的指令 汇编器无法修补。连接器使用print.o里的全局定义print,解析对print的引用。)
 

Patching instructions from main.o

Address patched

Symbol used

Symbol address

0x10 (mtext+16)

string

0x9C (mdata+0)

0x14 (mtext+20)

string

0x9C (mdata+0)

0x18 (mtext+24)

tolower

0x34 (mtext+52)

0x20 (mtext+32)

print

0x70 (ptext+0)

0x40 (mtext+64)

offset

0xA4 (mdata+8)

0x44 (mtext+68)

offset

0xA4 (mdata+8)

0x60 (mtext+96)

loop

0x50 (mtext+80)

 
6. Also note that for private symbols, such as done, the linker uses the definition from the symbol table for the file it is patching.
(还注意,私有符号,比如 done,连接器使用符号表里的定义修正之。)
 

Patching instructions from print.o

Address patched

Symbol used

Symbol address

0x74 (ptext+20)

newline

0xA8 (pdata+0)

0x78 (ptext+24)

newline

0xA8 (pdata+0)

 
 
Loading
1. The linker assumes the text segment starts at address 0. If the program is indeed loaded (copied) into memory at address 0, it will run correctly.
(连接器假设text段开始于地址0。如果该程序确实是加载(复制)到内存在地址0,将可正常运行。)
2. Running more than one program at a time, however, requires loading all but one of them at an address that is not zero. Suppose the program is loaded at address 1000, instead of 0. This means that the segments' starting addresses are all off by 1000.
(但是若1次运行多个程序时,除了一个程序可以是地址0,其余的程序就要加载到其他地址。假设该程序被加载在地址1000处。这意味着所有段的首地址都偏移1000 。)
3. The loader will have to use the patch lists and symbol tables to once again patch instructions to use the new starting addresses. The program will then run correctly when copied into memory at address 1000.
(加载器将不得不使用补丁列表和符号表再次 以新的开始地址1000修补指令。于是该程序复制到内存地址1000处,将可正常运行。)
 
the end.
 
阅读(1050) | 评论(0) | 转发(1) |
给主人留下些什么吧!~~