Chinaunix首页 | 论坛 | 博客
  • 博客访问: 1140354
  • 博文数量: 196
  • 博客积分: 4141
  • 博客等级: 中将
  • 技术积分: 2253
  • 用 户 组: 普通用户
  • 注册时间: 2009-03-21 20:04
文章存档

2019年(31)

2016年(1)

2014年(16)

2011年(8)

2010年(25)

2009年(115)

分类:

2009-03-22 11:49:42

Internals of the Netwide Assembler

==================================

 

The Netwide Assembler is intended to be a modular, re-usable x86

assembler, which can be embedded in other programs, for example as

the back end to a compiler.

 

The assembler is composed of modules. The interfaces between them

look like:

 

          +--- preproc.c ----+

          |          |       |

          +---- parser.c ----+

          |          |       |

          |     float.c      |

          |          |       |

          +--- assemble.c ---+

          |          |       |

    nasm.c ---+     insnsa.c     +--- nasmlib.c

          |          |       |

          +--- listing.c ----+

          |          |       |

          +---- labels.c ----+

          |          |       |

          +--- outform.c ----+

          |          |       |

          +----- *out.c -----+

 

In other words, each of `preproc.c', `parser.c', `assemble.c',

`labels.c', `listing.c', `outform.c' and each of the output format

modules `*out.c' are independent modules, which do not directly

inter-communicate except through the main program.

 

The Netwide *Disassembler* is not intended to be particularly

portable or reusable or anything, however. So I won't bother

documenting it here. :-)

 

nasmlib.c

---------

 

This is a library module; it contains simple library routines which

may be referenced by all other modules. Among these are a set of

wrappers around the standard `malloc' routines, which will report a

fatal error if they run out of memory, rather than returning NULL.

 

preproc.c

---------

 

This contains a macro preprocessor, which takes a file name as input

and returns a sequence of preprocessed source lines. The only symbol

exported from the module is `nasmpp', which is a data structure of

type `Preproc', declared in nasm.h. This structure contains pointers

to all the functions designed to be callable from outside the

module.

 

parser.c

--------

 

This contains a source-line parser. It parses `canonical' assembly

source lines, containing some combination of the `label', `opcode',

`operand' and `comment' fields: it does not process directives or

macros. It exports two functions: `parse_line' and `cleanup_insn'.

 

`parse_line' is the main parser function: you pass it a source line

in ASCII text form, and it returns you an `insn' structure

containing all the details of the instruction on that line. The

parameters it requires are:

 

- The location (segment, offset) where the instruction on this line

  will eventually be placed. This is necessary in order to evaluate

  expressions containing the Here token, `$'.

 

- A function which can be called to retrieve the value of any

  symbols the source line references.

 

- Which pass the assembler is on: an undefined symbol only causes an

  error condition on pass two.

 

- The source line to be parsed.

 

- A structure to fill with the results of the parse.

 

- A function which can be called to report errors.

 

Some instructions (DB, DW, DD for example) can require an arbitrary

amount of storage, and so some of the members of the resulting

`insn' structure will be dynamically allocated. The other function

exported by `parser.c' is `cleanup_insn', which can be called to

deallocate any dynamic storage associated with the results of a

parse.

 

names.c

-------

 

This doesn't count as a module - it defines a few arrays which are

shared between NASM and NDISASM, so it's a separate file which is

#included by both parser.c and disasm.c.

 

float.c

-------

 

This is essentially a library module: it exports one function,

`float_const', which converts an ASCII representation of a

floating-point number into an x86-compatible binary representation,

without using any built-in floating-point arithmetic (so it will run

on any platform, portably). It calls nothing, and is called only by

`parser.c'. Note that the function `float_const' must be passed an

error reporting routine.

 

assemble.c

----------

 

This module contains the code generator: it translates `insn'

structures as returned from the parser module into actual generated

code which can be placed in an output file. It exports two

functions, `assemble' and `insn_size'.

 

`insn_size' is designed to be called on pass one of assembly: it

takes an `insn' structure as input, and returns the amount of space

that would be taken up if the instruction described in the structure

were to be converted to real machine code. `insn_size' also requires

to be told the location (as a segment/offset pair) where the

instruction would be assembled, the mode of assembly (16/32 bit

default), and a function it can call to report errors.

 

`assemble' is designed to be called on pass two: it takes all the

parameters that `insn_size' does, but has an extra parameter which

is an output driver. `assemble' actually converts the input

instruction into machine code, and outputs the machine code by means

of calling the `output' function of the driver.

 

insnsa.c

--------

 

This is another library module: it exports one very big array of

instruction translations. It is generated automatically from the

insns.dat file by the insns.pl script.

 

labels.c

--------

 

This module contains a label manager. It exports six functions:

 

`init_labels' should be called before any other function in the

module. `cleanup_labels' may be called after all other use of the

module has finished, to deallocate storage.

 

`define_label' is called to define new labels: you pass it the name

of the label to be defined, and the (segment,offset) pair giving the

value of the label. It is also passed an error-reporting function,

and an output driver structure (so that it can call the output

driver's label-definition function). `define_label' mentally

prepends the name of the most recently defined non-local label to

any label beginning with a period.

 

`define_label_stub' is designed to be called in pass two, once all

the labels have already been defined: it does nothing except to

update the "most-recently-defined-non-local-label" status, so that

references to local labels in pass two will work correctly.

 

`declare_as_global' is used to declare that a label should be

global. It must be called _before_ the label in question is defined.

 

Finally, `lookup_label' attempts to translate a label name into a

(segment,offset) pair. It returns non-zero on success.

 

The label manager module is (theoretically :) restartable: after

calling `cleanup_labels', you can call `init_labels' again, and

start a new assembly with a new set of symbols.

 

listing.c

---------

 

This file contains the listing file generator. The interface to the

module is through the one symbol it exports, `nasmlist', which is a

structure containing six function pointers. The calling semantics of

these functions isn't terribly well thought out, as yet, but it

works (just about) so it's going to get left alone for now...

 

outform.c

---------

 

This small module contains a set of routines to manage a list of

output formats, and select one given a keyword. It contains three

small routines: `ofmt_register' which registers an output driver as

part of the managed list, `ofmt_list' which lists the available

drivers on stdout, and `ofmt_find' which tries to find the driver

corresponding to a given name.

 

The output modules

------------------

 

Each of the output modules, `outbin.o', `outelf.o' and so on,

exports only one symbol, which is an output driver data structure

containing pointers to all the functions needed to produce output

files of the appropriate type.

 

The exception to this is `outcoff.o', which exports _two_ output

driver structures, since COFF and Win32 object file formats are very

similar and most of the code is shared between them.

 

nasm.c

------

 

This is the main program: it calls all the functions in the above

modules, and puts them together to form a working assembler. We

hope. :-)

 

Segment Mechanism

-----------------

 

In NASM, the term `segment' is used to separate the different

sections/segments/groups of which an object file is composed.

Essentially, every address NASM is capable of understanding is

expressed as an offset from the beginning of some segment.

 

The defining property of a segment is that if two symbols are

declared in the same segment, then the distance between them is

fixed at assembly time. Hence every externally-declared variable

must be declared in its own segment, since none of the locations of

these are known, and so no distances may be computed at assembly

time.

 

The special segment value NO_SEG (-1) is used to denote an absolute

value, e.g. a constant whose value does not depend on relocation,

such as the _size_ of a data object.

 

Apart from NO_SEG, segment indices all have their least significant

bit clear, if they refer to actual in-memory segments. For each

segment of this type, there is an auxiliary segment value, defined

to be the same number but with the LSB set, which denotes the

segment-base value of that segment, for object formats which support

it (Microsoft .OBJ, for example).

 

Hence, if `textsym' is declared in a code segment with index 2, then

referencing `SEG textsym' would return zero offset from

segment-index 3. Or, in object formats which don't understand such

references, it would return an error instead.

 

The next twist is SEG_ABS. Some symbols may be declared with a

segment value of SEG_ABS plus a 16-bit constant: this indicates that

they are far-absolute symbols, such as the BIOS keyboard buffer

under MS-DOS, which always resides at 0040h:001Eh. Far-absolutes are

handled with care in the parser, since they are supposed to evaluate

simply to their offset part within expressions, but applying SEG to

one should yield its segment part. A far-absolute should never find

its way _out_ of the parser, unless it is enclosed in a WRT clause,

in which case Microsoft 16-bit object formats will want to know

about it.

 

Porting Issues

--------------

 

We have tried to write NASM in portable ANSI C: we do not assume

little-endianness or any hardware characteristics (in order that

NASM should work as a cross-assembler for x86 platforms, even when

run on other, stranger machines).

 

Assumptions we _have_ made are:

 

- We assume that `short' is at least 16 bits, and `long' at least

  32. This really _shouldn't_ be a problem, since Kernighan and

  Ritchie tell us we are entitled to do so.

 

- We rely on having more than 6 characters of significance on

  externally linked symbols in the NASM sources. This may get fixed

  at some point. We haven't yet come across a linker brain-dead

  enough to get it wrong anyway.

 

- We assume that `fopen' using the mode "wb" can be used to write

  binary data files. This may be wrong on systems like VMS, with a

  strange file system. Though why you'd want to run NASM on VMS is

  beyond me anyway.

 

That's it. Subject to those caveats, NASM should be completely

portable. If not, we _really_ want to know about it.

 

Porting Non-Issues

------------------

 

The following is _not_ a portability problem, although it looks like

one.

 

- When compiling with some versions of DJGPP, you may get errors

  such as `warning: ANSI C forbids braced-groups within

  expressions'. This isn't NASM's fault - the problem seems to be

  that DJGPP's definitions of the macros include a

  GNU-specific C extension. So when compiling using -ansi and

  -pedantic, DJGPP complains about its own header files. It isn't a

  problem anyway, since it still generates correct code.

 

Internals of the Netwide Assembler

==================================

 

The Netwide Assembler is intended to be a modular, re-usable x86

assembler, which can be embedded in other programs, for example as

the back end to a compiler.

 

The assembler is composed of modules. The interfaces between them

look like:

 

          +--- preproc.c ----+

          |          |       |

          +---- parser.c ----+

          |          |       |

          |     float.c      |

          |          |       |

          +--- assemble.c ---+

          |          |       |

    nasm.c ---+     insnsa.c     +--- nasmlib.c

          |          |       |

          +--- listing.c ----+

          |          |       |

          +---- labels.c ----+

          |          |       |

          +--- outform.c ----+

          |          |       |

          +----- *out.c -----+

 

In other words, each of `preproc.c', `parser.c', `assemble.c',

`labels.c', `listing.c', `outform.c' and each of the output format

modules `*out.c' are independent modules, which do not directly

inter-communicate except through the main program.

 

The Netwide *Disassembler* is not intended to be particularly

portable or reusable or anything, however. So I won't bother

documenting it here. :-)

 

nasmlib.c

---------

 

This is a library module; it contains simple library routines which

may be referenced by all other modules. Among these are a set of

wrappers around the standard `malloc' routines, which will report a

fatal error if they run out of memory, rather than returning NULL.

 

preproc.c

---------

 

This contains a macro preprocessor, which takes a file name as input

and returns a sequence of preprocessed source lines. The only symbol

exported from the module is `nasmpp', which is a data structure of

type `Preproc', declared in nasm.h. This structure contains pointers

to all the functions designed to be callable from outside the

module.

 

parser.c

--------

 

This contains a source-line parser. It parses `canonical' assembly

source lines, containing some combination of the `label', `opcode',

`operand' and `comment' fields: it does not process directives or

macros. It exports two functions: `parse_line' and `cleanup_insn'.

 

`parse_line' is the main parser function: you pass it a source line

in ASCII text form, and it returns you an `insn' structure

containing all the details of the instruction on that line. The

parameters it requires are:

 

- The location (segment, offset) where the instruction on this line

  will eventually be placed. This is necessary in order to evaluate

  expressions containing the Here token, `$'.

 

- A function which can be called to retrieve the value of any

  symbols the source line references.

 

- Which pass the assembler is on: an undefined symbol only causes an

  error condition on pass two.

 

- The source line to be parsed.

 

- A structure to fill with the results of the parse.

 

- A function which can be called to report errors.

 

Some instructions (DB, DW, DD for example) can require an arbitrary

amount of storage, and so some of the members of the resulting

`insn' structure will be dynamically allocated. The other function

exported by `parser.c' is `cleanup_insn', which can be called to

deallocate any dynamic storage associated with the results of a

parse.

 

names.c

-------

 

This doesn't count as a module - it defines a few arrays which are

shared between NASM and NDISASM, so it's a separate file which is

#included by both parser.c and disasm.c.

 

float.c

-------

 

This is essentially a library module: it exports one function,

`float_const', which converts an ASCII representation of a

floating-point number into an x86-compatible binary representation,

without using any built-in floating-point arithmetic (so it will run

on any platform, portably). It calls nothing, and is called only by

`parser.c'. Note that the function `float_const' must be passed an

error reporting routine.

 

assemble.c

----------

 

This module contains the code generator: it translates `insn'

structures as returned from the parser module into actual generated

code which can be placed in an output file. It exports two

functions, `assemble' and `insn_size'.

 

`insn_size' is designed to be called on pass one of assembly: it

takes an `insn' structure as input, and returns the amount of space

that would be taken up if the instruction described in the structure

were to be converted to real machine code. `insn_size' also requires

to be told the location (as a segment/offset pair) where the

instruction would be assembled, the mode of assembly (16/32 bit

default), and a function it can call to report errors.

 

`assemble' is designed to be called on pass two: it takes all the

parameters that `insn_size' does, but has an extra parameter which

is an output driver. `assemble' actually converts the input

instruction into machine code, and outputs the machine code by means

of calling the `output' function of the driver.

 

insnsa.c

--------

 

This is another library module: it exports one very big array of

instruction translations. It is generated automatically from the

insns.dat file by the insns.pl script.

 

labels.c

--------

 

This module contains a label manager. It exports six functions:

 

`init_labels' should be called before any other function in the

module. `cleanup_labels' may be called after all other use of the

module has finished, to deallocate storage.

 

`define_label' is called to define new labels: you pass it the name

of the label to be defined, and the (segment,offset) pair giving the

value of the label. It is also passed an error-reporting function,

and an output driver structure (so that it can call the output

driver's label-definition function). `define_label' mentally

prepends the name of the most recently defined non-local label to

any label beginning with a period.

 

`define_label_stub' is designed to be called in pass two, once all

the labels have already been defined: it does nothing except to

update the "most-recently-defined-non-local-label" status, so that

references to local labels in pass two will work correctly.

 

`declare_as_global' is used to declare that a label should be

global. It must be called _before_ the label in question is defined.

 

Finally, `lookup_label' attempts to translate a label name into a

(segment,offset) pair. It returns non-zero on success.

 

The label manager module is (theoretically :) restartable: after

calling `cleanup_labels', you can call `init_labels' again, and

start a new assembly with a new set of symbols.

 

listing.c

---------

 

This file contains the listing file generator. The interface to the

module is through the one symbol it exports, `nasmlist', which is a

structure containing six function pointers. The calling semantics of

these functions isn't terribly well thought out, as yet, but it

works (just about) so it's going to get left alone for now...

 

outform.c

---------

 

This small module contains a set of routines to manage a list of

output formats, and select one given a keyword. It contains three

small routines: `ofmt_register' which registers an output driver as

part of the managed list, `ofmt_list' which lists the available

drivers on stdout, and `ofmt_find' which tries to find the driver

corresponding to a given name.

 

The output modules

------------------

 

Each of the output modules, `outbin.o', `outelf.o' and so on,

exports only one symbol, which is an output driver data structure

containing pointers to all the functions needed to produce output

files of the appropriate type.

 

The exception to this is `outcoff.o', which exports _two_ output

driver structures, since COFF and Win32 object file formats are very

similar and most of the code is shared between them.

 

nasm.c

------

 

This is the main program: it calls all the functions in the above

modules, and puts them together to form a working assembler. We

hope. :-)

 

Segment Mechanism

-----------------

 

In NASM, the term `segment' is used to separate the different

sections/segments/groups of which an object file is composed.

Essentially, every address NASM is capable of understanding is

expressed as an offset from the beginning of some segment.

 

The defining property of a segment is that if two symbols are

declared in the same segment, then the distance between them is

fixed at assembly time. Hence every externally-declared variable

must be declared in its own segment, since none of the locations of

these are known, and so no distances may be computed at assembly

time.

 

The special segment value NO_SEG (-1) is used to denote an absolute

value, e.g. a constant whose value does not depend on relocation,

such as the _size_ of a data object.

 

Apart from NO_SEG, segment indices all have their least significant

bit clear, if they refer to actual in-memory segments. For each

segment of this type, there is an auxiliary segment value, defined

to be the same number but with the LSB set, which denotes the

segment-base value of that segment, for object formats which support

it (Microsoft .OBJ, for example).

 

Hence, if `textsym' is declared in a code segment with index 2, then

referencing `SEG textsym' would return zero offset from

segment-index 3. Or, in object formats which don't understand such

references, it would return an error instead.

 

The next twist is SEG_ABS. Some symbols may be declared with a

segment value of SEG_ABS plus a 16-bit constant: this indicates that

they are far-absolute symbols, such as the BIOS keyboard buffer

under MS-DOS, which always resides at 0040h:001Eh. Far-absolutes are

handled with care in the parser, since they are supposed to evaluate

simply to their offset part within expressions, but applying SEG to

one should yield its segment part. A far-absolute should never find

its way _out_ of the parser, unless it is enclosed in a WRT clause,

in which case Microsoft 16-bit object formats will want to know

about it.

 

Porting Issues

--------------

 

We have tried to write NASM in portable ANSI C: we do not assume

little-endianness or any hardware characteristics (in order that

NASM should work as a cross-assembler for x86 platforms, even when

run on other, stranger machines).

 

Assumptions we _have_ made are:

 

- We assume that `short' is at least 16 bits, and `long' at least

  32. This really _shouldn't_ be a problem, since Kernighan and

  Ritchie tell us we are entitled to do so.

 

- We rely on having more than 6 characters of significance on

  externally linked symbols in the NASM sources. This may get fixed

  at some point. We haven't yet come across a linker brain-dead

  enough to get it wrong anyway.

 

- We assume that `fopen' using the mode "wb" can be used to write

  binary data files. This may be wrong on systems like VMS, with a

  strange file system. Though why you'd want to run NASM on VMS is

  beyond me anyway.

 

That's it. Subject to those caveats, NASM should be completely

portable. If not, we _really_ want to know about it.

 

Porting Non-Issues

------------------

 

The following is _not_ a portability problem, although it looks like

one.

 

- When compiling with some versions of DJGPP, you may get errors

  such as `warning: ANSI C forbids braced-groups within

  expressions'. This isn't NASM's fault - the problem seems to be

  that DJGPP's definitions of the macros include a

  GNU-specific C extension. So when compiling using -ansi and

  -pedantic, DJGPP complains about its own header files. It isn't a

  problem anyway, since it still generates correct code.

 

阅读(1636) | 评论(2) | 转发(0) |
给主人留下些什么吧!~~

micklongen2010-02-08 23:14:55

呵呵。这篇是直接拷贝nasm源码里面的一个说明文件的,不是我写的。 至于这篇的中文,貌似网上有人翻译过,以前好像有看过。 关于标题和内容是不是语种要一致,这个我倒没有考虑过。

chinaunix网友2010-02-08 18:43:06

有没有中文的, 既然内容是英文的。那标题为什么不也用英文。 就算标题不用英文,标题也该说明是英文的。