曾就职于阿里巴巴担任Oracle DBA,MySQL DBA,目前在新美大担任SRE。[是普罗米修斯还是一块石头,你自己选择!] 欢迎关注微信公众号 “自己的设计师”,不定期有原创运维文章推送。
分类: Mysql/postgreSQL
2013-12-17 11:58:42
Innodb’s data storage model uses “spaces”, often called “tablespaces” in the context of mysql, and sometimes called “file spaces” in innodb itself. A space may consist of multiple actual files at the operating system level (e.g. Ibdata1, ibdata2, etc.) But it is just a single logical file — multiple physical files are just treated as though they were physically concatenated together.
Innodb的数据存储模型通常使用“space”来标识,在mysql上下文中通常称作”tablespace”,在innodb内部有时候也被称作”file space”。在操作系统层面,一个表空间或许有多个实际的物理文件组成,但在逻辑上他们被看做一个统一的整体 —多个物理文件好像物理的连接在一起从而形成一个文件。
Each space in innodb is assigned a 32-bit integer space ID, which is used in many different places to refer to the space. Innodb always has a “system space”, which is always assigned the space ID of 0. The system space is used for various special bookkeeping that innodb requires. Through mysql, innodb currently only supports additional spaces in the form of “file per table” spaces, which create an .ibd file for each mysql table. Internally, this .ibd file is actually a fully functional space which could contain multiple tables, but in the implementation with mysql, they will only contain a single table.
Innodb会为每个表空间分配一个32-bit的整型的space ID,它被用作很多不同的地方来指向同一个space。Innodb总是会有一个”system space(系统表空间)”,该表空间也总是会用space ID为0来标识。系统表空间会存储各种innodb所需要的一些特殊记录。在mysql中,除了系统表空间外,innodb目前只支持以”file per table(innodb_file_per_table=on)”形式的额外的表空间,这种新式的表空间会为每一个mysql的表创建一个.ibd文件。在内部,这个.ibd文件实际上一个功能齐全的,可以包含多个表的文件,但是在mysql的实现上,它只包含一个表。
PagesEach space is divided into pages, normally 16 kib each (this can differ for two reasons: if the compile-time define UNIV_PAGE_SIZE is changed, or if innodb compression is used). Each page within a space is assigned a 32-bit integer page number, often called “offset”, which is actually just the page’s offset from the beginning of the space (not necessarily the file, for multi-file spaces). So, page 0 is located at file offset 0, page 1 at file offset 16384, and so on. (The astute may remember that innodb has a limit of 64tib of data; this is actually a limit per space, and is due primarily to the page number being a 32-bit integer combined with the default page size: 232 x 16 kib = 64 tib.)
每一个空间被划分成很多的页,每个页的大小为16KB (有两种原因可能导致这个大小不是16KB:[1] 在编译的时候UNIV_PAGE_SIZE定义被改变. [2] innodb使用了压缩的格式存储.).每一个page被分配一个32位整型的page number,通常被称作”偏移”,page number实际只是标识这个页在同一个space id标识的表空间中的起始位置的偏移量.(不一定是一个文件,也有可能是多个文件)。简而言之,page 0标识的是偏移量为0的页,page 1标识偏移量为 1*16k=1*16384的页,等等.如果细心回忆,你可能还记得,Innodb对每一个表空间的数据总量限制为64TB,这主要是由于page number是一个32位的整型,所以它所能标识最大的表空间的总大小为: 232 x 16 kib = 64 tib(在使用默认页为16KB的情况下)。
A page is laid out as follows:
一个页的基本结构如下:
Every page has a 38-byte FIL header and 8-byte FIL trailer (FIL is a shortened form of “file”). The header contains a field which is used to indicate the page type, which determines the structure of the rest of the page. The structure of the FIL header and trailer are:
每一个页包含一个38字节的文件头和一个8字节的文件尾。文件头包含一个被用于表示个文件类型的区域,该类型决定了这个页剩下空间的结构(除了头和尾以外剩下的空间)。文件头和文件尾的结构如下所示:
The FIL header and trailer contain the following structures (not in order):
文件头和文件尾包含如下的结构(不是按照存储的顺序描述):
- The page type is stored in the header. This is necessary in order to parse the rest of the page data. Pages are allocated for file space management, extent management, the transaction system, the data dictionary, undo logs, blobs, and of course indexes (table data).
Page type存储在文件头。文件类型的作用主要作用是为了解析page data部分。Page的分配主要用途包括file space管理,extent管理,事物系统,数据字典,undo日志,blobs,当然还包括索引(表数据)。
- The space ID is stored in the header.
Space ID被存储在page头部
- The page number is stored in the header once the page has been initialized. Checking that the page number read from that field matches what it should be based on the offset into the file is helpful to indicate that reading is correct, and this field being initialized indicates that the page has been initialized.
一旦该页被初始化,page number便会存储在文件头。根据从该区域读出的偏移量,检查是否和文件的偏移量匹配,从而有助于检测该读取的数据是否正确,而且该区域初始化也说明了该page也被初始化了。
- A 32-bit checksum is stored in the header, and an older format (and broken) 32-bit checksum is stored in the trailer. The older checksum could be deprecated and that space reclaimed at some point.
一个32位的校验字符被存储page头部,旧格式的32位的校验位存储page尾部。旧格式的校验方式已经被废弃而且所占的空间在某些时候也可能被回收。
- Pointers to the logical previous and next page for this page type are stored in the header. This allows doubly-linked lists of pages to be built, and this is used for INDEX pages to link all pages at the same level, which allows for e.g. Full index scans to be efficient. Many page types do not use these fields.
Pointers 存储在page头部,用来存储该类型的page在逻辑上的前一个和后一个page的指针。这允许使用双向链表的方式来建立这些page,而且这种方式也被index page使用,来将所有在同一级的page连接起来,这样会使例如全索引扫描更加高效。许多类型的页不使用这些 区域。
- The 64-bit log sequence number (LSN) of the last modification of the page is stored in the header, and the low 32-bits of the same LSN are stored in the trailer.
64位的日志序列号存储在page的头部,用来表示最后一次该页修改的日志序列号,而且该日志序列号的低32位存储在该page的尾部。
- A 64-bit “flush LSN” field is stored in the header, which is actually only populated for a single page in the entire system, page 0 of space 0. This stores the highest LSN flushed to any page in the entire system (all spaces). This field is a great candidate for re-use in the rest of the space.
一个64位的”刷新日志序列号”区域存储在page的头部,该区域只有在整个表空间中page id为0且page number为0的page上才会被填充。该区域存储了所有表空间所有被刷新的page中的最大刷新日志序列号。该区域是对该表空间中重新使用的最好的候补空间(因为除了page number为0的page使 用外,其他page不使用)。Space files
A space file is just a concatenation of many (up to 232) pages. For more efficient management, pages are grouped into blocks of 1 mib (64 contiguous pages with the default page size of 16 kib), and called an “extent”. Many structures then refer only to extents to allocate pages within a space.Innodb needs to do some bookkeeping to keep track of all of the pages, extents, and the space itself, so a space file has some mandatory super-structure:
一个表空间文件由许多的page组合而成(最多为 232).为了更有效的管理,page被组织成一个1MB的block(64个连续的page,每个page使用默认的16KB的空间),被称作”extent”.许多结构只用extent来为表空间分配存储空间。
Innodb需要做一些记录来跟踪所有的page,extent和表空间自身,所以表空间文件有一些必须的超级结构。
The first page (page 0) in a space is always an FSP_HDR or “file space header” page. The FSP_HDR page contains (confusingly) an FSP header structure, which tracks things like the size of the space and lists of free, fragmented, and full extents. (A more detailed discussion of free space management is reserved for a future post.)
表空间的第一个page(page 0)总是一个FSP_HDR或者说是一个包含表空间元数据信息的page。 FSP_HDR page包含了(容易混淆)一个FSP的头结构,用来跟踪记录诸如表空间的大小,free lists的信息,页分裂的信息,和所有extent的信息。(更多关于空间空间的管理详细讨论在保留到以后的文章中会加以分析)。
An FSP_HDR page only has enough space internally to store bookkeeping information for 256 extents (or 16,384 pages, 256 mib), so additional space must be reserved every 16,384 pages for bookkeeping information in the form of an XDES page. The structure of XDES and FSP_HDR pages is identical, except that the FSP header structure is zeroed-out in XDES pages. These additional pages are allocated automatically as a space file grows.
一个FSP_HDR 页的内部只有足够的空间来记录256个extent的信息(或者16384个page,256MB),所以额外的XDES page空必须为每16384个页预留空间来记录相关的信息。,FSP头部结构是一个清零的XDES PAGE,除此之外XDES的结构和FSP_HDR的结构相同。额外的页会在表空间的文件增长的同时被自动的分配。
The third page in each space (page 2) will be an INODE page, which is used to store lists related to file segments (groupings of extents plus an array of singly-allocated “fragment” pages). Each INODE page can store 85 INODE entries, and each index requires two INODE entries. (A more detailed discussion of INODE entries and file segments is reserved for a future post.)
Alongside each FSP_HDR or XDES page will also be an IBUF_BITMAP page, which is used for bookkeeping information related to insert buffering, and is outside the scope of this post.
每个表空间的第三个page(page 2)将会是一个INODE page,它用于存储相关文件segment的list(由extents和一些单独分配的”片段”的集合组成)。每一个INNODE page可以保存85个INNODE条目,而且每个INDEX PAGE需要至少两个INNODE条目。(更多关于INNODE条目和文件 segments的讨论在以后发表的文章中会做详细的讨论。)
除了每个FSP_HDR或者XDES page,还包括一个IBUF_BITMAP page,被用于记录关于insert buffering的信息,这已经超出了本文要讨论的范围The system space
The system space (space 0) is special in innodb, and contains quite a few pages allocated at fixed page numbers to store a wide range of information critical to innodb’s operation. Since the system space is a space like any other, it has the required FSP_HDR, IBUF_BITMAP, and INODE pages allocated as its first three pages. After that, it is a bit special:
系统表空间(space id为0的空间)在Innodb中是一个特殊的表空间,而且包含一些固定page number的page来存储关于innodb操作至关重要的信息。因为系统包空间和其他表空间很像,它也包括FSP_HDR,IBUF_BITMAP,和INNODE pages三个page,这三个page都分配在该表
空间的前三个page中。从第三个页之后,它就有点特别了:
The following pages are allocated:
下面是系统表空间page的信息:
- Page 3, type SYS: Headers and bookkeeping information related to insert buffering.
Page 3,SYS类型:insert buffer的头部而且储存的是和insert buffering相关的信息。
- Page 4, type INDEX: The root page of the index structure used for insert buffering.
Page 4,INDX 类型:用于存放insert buffering信息的根索引节点。
- Page 5, type TRX_SYS: Information related to the operation of innodb’s transaction system, such as the latest transaction ID, mysql binary log information, and the location of the double write buffer extents.
Page 5,TRX_SYS类型,包括innodb事物系统操作信息,比如最后的事物ID,mysql二进制日志信息,双写缓存extents的位置信息等。
- Page 6, type SYS: The first rollback segment page. Additional pages (or whole extents) are allocated as needed to store rollback segment data.
Page 6, SYS类型:第一个回滚段的page。额外的page(或者说整个extents)在需要存放回滚段信息的时候分配。
- Page 7, type SYS: Headers related to the data dictionary, containing root page numbers for the indexes that make up the data dictionary. This information is required to be able to find any other indexes (tables), as their root page numbers are stored in the data dictionary itself.
Page 7,SYS类型:关于数据字典的头部信息,包含组成数据字典的索引的根节点编号(root page number).该信息存在的意义是以便能够找到其他任何索引(表),这些根节点page number存储在这个数据字典中。
- Pages 64-127: The first block of 64 pages (an extent) in the double write buffer. The double write buffer is used as part of innodb's recovery mechanism.
Page 64-127:双写缓存中第一个64 page组成的block(一个extent).双写缓存主要是用于innodb的恢复机制部分。
- Pages 128-191: The second block of the double write buffer.
Page 128-191:双写缓存中的第二个block。All other pages are allocated on an as-needed basis to indexes, rollback segments, undo logs, etc.
所有其他的页按需分配,包括index,回滚段,undo日志,等等..
Per-table space files
Innodb offers a "file per table" mode, which will create a file (which as explained above is actually a space) for each mysql table created. A better name for this feature may be "space per table" rather than "file per table". The .ibd file created for each table has the typical space file structure:
Innodb提供了一个”file per table”的模式,它可以为每一个mysql表创建一个文件(在上面的解释中实际被称作表空间).相比”file per table”而言,”space per table”或许是一个更好的名称.每一个表都会创建一个.ibd文件,该典型的表空间文件的结构如下:
Ignoring "fast index creation" which adds indexes at runtime, after the requisite 3 initial pages, the next pages allocated in the space will be the root pages of each index in the table, in the order they were defined in the table creation. Page 3 will be the root of the clustered index, Page 4 will be the root of the first secondary key, etc.
忽略”fast index creation”会在运行时加索引的情况,在必需的前三个初始化page之后的page分配的是索引组织表的根索引节点,顺序的记录了表创建定义的一些信息。 Page 3是索引组织表的根节点,Page 4将会是第一个二级索引的根节点,等等.Since most of innodb's bookkeeping structures are stored in the system space, most pages allocated in a per-table space will be of type INDEX and store table data.
因为大部分innodb的记录结构都存放在系统表空间,每个page被分配给每个表空间(非系统表空间)将会是一个INDEX类型,同时存储了表的数据。
What's next?
Next we'll look at free space management within innodb: extent descriptors, file segments (inodes), and lists.
下一节,我们探索innodb中关于空闲空间的管理:extent描述信息,file段(inodes),和链表
Ignoring "fast index creation" which adds indexes at runtime, after the requisite 3 initial pages, the next pages allocated in the space will be the root pages of each index in the table, in the order they were defined in the table creation. Page 3 will be the root of the clustered index, Page 4 will be the root of the first secondary key, etc.忽略”fast index creation”会在运行时加索引的情况,在必需的前三个初始化page之后的page,分配的是索引组织表的根索引节点,顺序的记录了表创建定义的一些信息。Page 3是索引组织表的根节点,Page 4将会是第一个二级索引的根节点,等等.
Since most of innodb's bookkeeping structures are stored in the system space, most pages allocated in a per-table space will be of type INDEX and store table data.因为大部分innodb的记录结构都存放在系统表空间,每个page被分配给每个表空间(非系统表空间)将会是一个INDEX类型,同时存储了表的数据。
What's next?
Next we'll look at free space management within innodb: extent descriptors, file segments (inodes), and lists.
下一节,我们探索innodb中关于空闲空间的管理:extent描述信息,file段(inodes),和链表
该文章翻译自google高级架构师Jeremy Cole
原文: http://blog.jcole.us/2013/01/03/the-basics-of-innodb-space-file-layout/
翻译当中可能有很多不准确的地方,希望看到的朋友能够指出来。