分类:
2011-08-04 20:02:26
V0.0 20/12/01 First Draft
V0.1 | 11/01/02 | Minor corrections & cosmetics. |
V0.2 | 28/01/02 | Added observations about inodes, file headers and hard links. |
V0.3 | 26/02/02 | W:Added some general observations on compatibility, partitions and bootloading. |
The purpose of this document is to outline a potential NAND-friendly file system for Linux.
BackgroundThere are already various flash-file systems (FFSs) or block drivers for flash (on top of which a regular FS runs). There are pros and cons with all of these.
Flash memory has quite a few constraints which will not be addressed here. Various approaches are available to work around these constraints to provide a file system. It is important to recognise that "flash" includes both NOR and NAND flash which have different sets of constraints. It is easy to be mislead by the generic term "flash" into thinking that approaches appropriate for NOR flash are immediately suitable for NAND flash.
The NAND block drivers (eg. SmartMedia [currently not available for Linux] and DiskOnChip NFTL) typically use FAT16 as the file system. This isn't too robust and nor is it that flash-friendly. These block drivers provide a logical to physical mapping layer to emulate rewritable blocks that look like disk sectors. When used with FAT16, these file systems work reasonably well. They have a low memory footprint and scale well. Like all FAT based systems they are prone to corruption ("lost clusters etc").
The other approach is to design an entire file system which does not work through a block driver layer and is flash-friendly. This allows more scope to work around issues.
Currently, two Linux file systems that support NOR flash very well are JFFS and its derivative JFFS2. Both of these are based on the principles of journaling (hence the leading J) which significantly increases robustness - a particularly important feature in embedded systems. Unfortunately neither of these file systems scale particularly well in both boot time and RAM usage. Scaling is particularly relevant when one considers that a 16MB NOR array would be considered large while a 128MB NAND is available as a single chip.
JFFS requires a RAM-based jffs_node structure for each journalling node in the flash. Each of these nodes is 48 bytes. JFFS2 makes a significant improvement here by reducing the equivalent structure (jffs2_raw_node_ref) to 16 bytes. Still, at say an average node-size of 512 bytes, a 128MB NAND might need 250000 of these ... 4MB!
Both JFFS and JFFS2 require scanning the flash array at boot time to find the journaling nodes and determine the file structures. Since NAND is large, slow, serially accessed and needs ECC this does not scale well and will take an unacceptably long boot time for the target systems. As a thumb-suck, the scanning of a 128MB NAND array might take approx 25 seconds.
The intentions of the design sketched here are:
Be NAND-flash friendly.
Robustness through journaling strategies.
Significantly reduce the RAM overheads and boot times associated with JFFSx.
This FS is intended primarily for internal NAND rather than removable NAND (SM cards). On removable SM cards Smartmedia compatibility is likely to be important so SM/FAT will normally be used, although of course YAFFS makes a lot of sense if reliability is more important than compatibility.
OverviewHere follows a simplified overview of YAFFS.
YAFFS uses a physical flash format similar to SmartMedia. This is done for various reasons:
Some of the formatting, eg placement of bad block markers is determined by the NAND manufacturers and can't be changed.
Potential to reuse code.
If it ain't broke don't fix.
Some of the fields are obviously different to reflect the different usage. Despite the similarities YAFFS is not actually compatible with SM/FAT. SM cards need to be reformatted to switch from using SM/FAT to YAFFS or vice versa.
File data is stored in fixed size "chunks" consistent with the size of a page (ie. 512 bytes). Each page is marked with a file id and chunk number. These tags are stored in the "spare data" region of the flash. The chunk number is determined by dividing the file position by the chunk size.
When data in a file is overwritten, the relevant chunks are replaced by writing new pages to flash containing the new data but the same tags. The overwritten data is marked as "discarded".
File "headers" are stored as a single page, marked so as to be differentiated from data pages.
Pages are also marked with a short (2 bit) serial number that increments each time the page at this position is incremented. The reason for this is that if power loss/crash/other act of demonic forces happens before the replaced page is marked as discarded, it is possible to have two pages with the same tags. The serial number is used to arbitrate.
A block containing only discarded pages (termed a dirty block) is an obvious candidate for garbage collection. Otherwise valid pages can be copied off a block thus rendering the whole block discarded and ready for garbage collection.
In theory you don't need to hold the file structure in RAM... you could just scan the whole flash looking for pages when you need them. In practice though you'd want better file access times than that! The mechanism proposed here is to have a list of __u16 page addresses associated with each file. Since there are 218 pages in a 128MB NAND, a __u16 is insufficient to uniquely identify a page but is does identify a group of 4 pages - a small enough region to search exhaustively. This mechanism is clearly expandable to larger NAND devices - within reason. The RAM overhead with this approach is approx 2 bytes per page - 512kB of RAM for a whole 128MB NAND.
Boot-time scanning to build the file structure lists should require just one pass reading NAND. Since only the the spare data needs to be read, this should be relatively fast ( approx 3 seconds for 128MB). Improvements can be achieved by partitioning the NAND. ie. mount the critical partition first then mount the data partition afterwards.
Various runtime improvements can be achieved by changing the "chunk size" to 1024 bytes or more. However this would likely reduce flash efficiency. As always, life is a compromise....
The following table summarizes the layout of the spare area of each page.
Byte #
SmartMedia usage
YAFFS usage
0..511 | Data | Data. either file data or file header depending on tags |
512..515 | Reserved | Tags |
516 | Data status byte. Not used in SM code from Samsung | Data status byte. If more than 4 bits are zero, then this page is discarded. |
517 | Block status byte | Block status byte |
518..519 | Block address | Tags |
520..522 | ECC on second 256 bytes part of data | ECC on second 256 bytes of data |
523..524 | Block address | Tags |
525..527 | ECC on first 256 bytes part of data | ECC on first 256 bytes part of data |
The block status is a reserved value that shows whether the block is damaged.
The data status tracks whether the page is valid. If less than 4 bits are zero, then the page is valid otherwise it is discarded.
There are 8 bytes (64 bits) for use by YAFFS tags. This is partitioned as follows:
Number of bits
Usage
18 | 18-bit file id. ie. Limit of 218 (over 260000) files. File id 0 is not valid and indicates a deleted page. File Id 0x3FFFF i is also not valid. |
2 | 2-bit serial number. |
20 | 20-bit page id within file. Limit of 220 pages per file. ie. over 500MB file max size. Page id 0 means the file header for this file. |
10 | 10-bit counter of the number of bytes used in the page. |
12 | 12-bit ECC on tags. |
2 | Unused. Keep as 1. |
64 | Total |
A bit of explanation on the usage of some of these fields:
file Id is synonymous with inode.
The serial number is incremented each time a page with the same file_id:page_id is rewritten (because of data changes or copy during garbage collection). When a page is replaced, there is a brief period during which there are two pages with the same id. The serial number resolves this. Since there should never be a difference of less than more than one, a two-bit counter is sufficient to determine which is the current page.
When the page is rewritten, the file id, these data status byte and the 12-bit ECC are all written to zero.
The byte counter indicates how many bytes are valid in this page. Since the page would not exist if it contains zero bytes, this field should thus hold 512 for all pages except the last page in the file. The use of counters means that the file length integrity is preserved while the file is open without having to constantly update the file length in the file header. The file header only needs to be refreshed when the file is closed (rather than whenever it is appended to). This field is wide enough to allow expansion to 1024-byte "chunks".
File "headers" come in two flavours:
file info ( the mode, ower id, group id, length,...)
the hard link(s) that refers to the file.
A directory also appears as a file (ie. has an inode and hard link(s)) but has no data.
The 12-bit ECC applies to only the tag data uses a similar algorithm to the 22-bit ECCs used for file system data. They are kept independent.
RAM data detailsBlock management details are reasonably obvious and, I feel, don't need to be addressed here apart from stating that there will be a structure to track block status (eg. number of pages in use, failed blocks, and which are candidates for garbage collection etc).
The files need an indexing structure of sorts to locate and track the pages in the file. Some sort of tree structure should work rather well. The look-up needs to be quite efficient in both CPU time and space.
Page allocation and garbage collectionPages are allocated sequentially from the currently selected block. When all the pages in the block are filled, another clean block is selected for allocation. At least two or three clean blocks are reserved for garbage collection purposes. If there are insufficient clean blocks available, then a dirty block ( ie one containing only discarded pages) is erased to free it up as a clean block. If no dirty blocks are available, then the dirtiest block is selected for garbage collection.
Garbage collection is performed by copying the valid data pages into new data pages thus rendering all the pages in this block dirty and freeing it up for erasure. I also like the idea of selecting a block at random some small percentage of the time - thus reducing the chance of wear differences.
Relative to NOR, NAND writes and erases very fast. Therefore garbage collection might be performed on-demand (eg. during a write) without significant degradation in performance. Alternatively garbage collection might be delegated to a kernel tasklet.
Relative to JFFSx, the garbage collection presented here is incredibly simple - thanks mainly to the use of fixed-size pages instead of journaling nodes.
Flash writingAs presented here, YAFFS only writes to the page's data area once and to the spare area twice (once when new page is written and once when it gets stomped on) before an erasure. This is within the limits of the most restrictive NAND flashes.
Wear levellingNo wear levelling is explicitly used here. Instead we rely on two "strategies":
Reserving some blocks to cater for failure. You need to do this anyway with NAND. The main purpose behind wear levelling is to prevent some blocks getting more wear and failing. Since we expect, and handle, failure this is no longer as important.
The infrequent random block selection should prevent low-wear blocks getting "stuck".
Partitioning is not included in this spec, but could be added if required.
BootloadingBootloaders cannot just read files direct from NAND due to the high probability of bad blocks. Because YAFFS is quite simple it will be relatively straightforward for bootloaders to read from it (eg reading a kernel).
ConclusionYAFFS is very simple. It is also NAND-friendly, is relatively frugal with resources (especially RAM) and boots quickly. Like JFFSx it has journaling which makes it far more robust than FAT.
While it might seem high-risk to develop YAFFS, it is probably about the same amount of effort as implementing changes to JFFS to get it to work effectively within the constraints of NAND. A resulting JFFSx system would still require significant amounts of RAM and have long boot times.
While YAFFS is indeed a new file system internally, much of the file system glue code (including inode management, link management etc) can likely be stolen from JFFSx.