分类:
2008-10-13 16:11:12
This is documentation on the .chm format used by Microsoft HTML Help. This format has been reverse engineered in the past, but as far as I know this is the first freely available documentation on it. One Usenet message indicates that these .chm files are actually IStorage files documented in the Microsoft Platform SDK. However, I have not been able to locate such documentation.
The word "section" is badly overloaded in this document. Sorry about that.
All numbers are in hexadecimal unless otherwise indicated in the text. Except in tabular listings, this will be indicated by $ or 0x as appropriate. All values within the file are Intel byte order (little endian) unless indicated otherwise.
The .chm file begins with a short ($38 byte) initial header. This is followed by the header section table and the offset to the content. Collectively, this is the "header".
The header is followed by the header sections. There are two header sections. One header section is the file directory, the other contains the file length and some unknown data. Immediately following the header sections is the content.
The header starts with the initial header, which has the following format
0000: char[4] 'ITSF' 0004: DWORD 3 (Version number) 0008: DWORD Total header length, including header section table and following data. 000C: DWORD 1 (unknown) 0010: DWORD a timestamp. Considered as a big-endian DWORD, it appears to contain seconds (MSB) and fractional seconds (second byte). The third and fourth bytes may contain even more fractional bits. The 4 least significant bits in the last byte are constant. 0014: DWORD Windows Language ID. The two I've seen $0409 = LANG_ENGLISH/SUBLANG_ENGLISH_US $0407 = LANG_GERMAN/SUBLANG_GERMAN 0018: GUID {7C01FD10-7BAA-11D0-9E0C-00A0-C922-E6EC} 0028: GUID {7C01FD11-7BAA-11D0-9E0C-00A0-C922-E6EC}
Note: a GUID is $10 bytes, arranged as 1 DWORD, 2 WORDs, and 8 BYTEs.
It is followed by the header section table, which is 2 entries, where each entry is $10 bytes long and has this format:
0000: QWORD Offset of section from beginning of file 0008: QWORD Length of section
Following the header section table is 8 bytes of additional header data. In Version 2 files, this data is not there and the content section starts immediately after the directory.
0000: QWORD Offset within file of content section 0
This section contains the total size of the file, and not much else
0000: DWORD $01FE (unknown) 0004: DWORD 0 (unknown) 0008: QWORD File Size 0010: DWORD 0 (unknown) 0014: DWORD 0 (unknown)
The central part of the .chm file: A directory of the files and information it contains.
The directory starts with a header; its format is as follows:
0000: char[4] 'ITSP' 0004: DWORD Version number 1 0008: DWORD Length of the directory header 000C: DWORD $0a (unknown) 0010: DWORD $1000 Directory chunk size 0014: DWORD "Density" of quickref section, usually 2. 0018: DWORD Depth of the index tree 1 there is no index, 2 if there is one level of PMGI chunks. 001C: DWORD Chunk number of root index chunk, -1 if there is none (though at least one file has 0 despite there being no index chunk, probably a bug.) 0020: DWORD Chunk number of first PMGL (listing) chunk 0024: DWORD Chunk number of last PMGL (listing) chunk 0028: DWORD -1 (unknown) 002C: DWORD Number of directory chunks (total) 0030: DWORD Windows language ID 0034: GUID {5D02926A-212E-11D0-9DF9-00A0C922E6EC} 0044: DWORD $54 (This is the length again) 0048: DWORD -1 (unknown) 004C: DWORD -1 (unknown) 0050: DWORD -1 (unknown)
The header is directly followed by the directory chunks. There are two types of directory chunks -- index chunks, and listing chunks. The index chunk will be omitted if there is only one listing chunk. A listing chunk has the following format:
0000: char[4] 'PMGL' 0004: DWORD Length of free space and/or quickref area at end of directory chunk 0008: DWORD Always 0. 000C: DWORD Chunk number of previous listing chunk when reading directory in sequence (-1 if this is the first listing chunk) 0010: DWORD Chunk number of next listing chunk when reading directory in sequence (-1 if this is the last listing chunk) 0014: Directory listing entries (to quickref area) Sorted by filename; the sort is case-insensitive.
The quickref area is written backwards from the end of the chunk. One quickref entry exists for every n entries in the file, where n is calculated as 1 + (1 << quickref density). So for density = 2, n = 5.
Chunklen-0002: WORD Number of entries in the chunk Chunklen-0004: WORD Offset of entry n from entry 0 Chunklen-0008: WORD Offset of entry 2n from entry 0 Chunklen-000C: WORD Offset of entry 3n from entry 0 ...
The format of a directory listing entry is as follows
ENCINT: length of name BYTEs: name (UTF-8 encoded) ENCINT: content section ENCINT: offset ENCINT: length
The offset is from the beginning of the content section the file is in, after the section has been decompressed (if appropriate). The length also refers to length of the file in the section after decompression.
There are two kinds of file represented in the directory: user data and format related files. The files which are format-related have names which begin with '::', the user data files have names which begin with "/".
An index chunk has the following format
0000: char[4] 'PMGI' 0004: DWORD Length of quickref/free area at end of directory chunk 0008: Directory index entries (to quickref/free area)
The quickref area in an PMGI is the same as in an PMGL
The format of a directory index entry is as follows
ENCINT: length of name BYTEs: name (UTF-8 encoded) ENCINT: directory listing chunk which starts with name
When higher-level indexes exist (when the depth of the index tree is 3 or higher), presumably the upper-level indexes will contain the numbers of lower-level index chunks rather than listing chunks
An ENCINT is a variable-length integer. The high bit of each byte indicates "continued to the next byte". Bytes are stored most significant to least significant. So, for example, $EA $15 is (((0xEA&0x7F)<<7)|0x15) = 0x3515.
In Version 3, the content typically immediately follows the header sections, and is at the location indicated by the DWORD following the header section table. In Version 2, the content immediately follows the header. All content section 0 locations in the directory are relative to that point. The other content sections are stored WITHIN content section 0.
There exists in content section 0 and in the directory a file called "::DataSpace/NameList". This file contains the names of all the content sections. The format is as follows:
0000: WORD Length of file, in words 0002: WORD Number of entries in file Each entry: 0000: WORD Length of name in words, excluding terminating null 0002: WORD Double-byte characters xxxx: WORD 0
Yes, the names have a length word AND are null terminated; sort of a belt-and-suspenders approach. The coding system is likely UTF-16 (little endian).
The section names seen so far are
"Uncompressed" is self-explanatory. The section "MSCompressed" is compressed with Microsoft's LZX algorithm.
For each section other than 0, there exists a file called '::DataSpace/Storage/
There are several other files associated with the sections
This file contains $20 bytes of information on the compression. The information is partially known:
0000: DWORD Number of DWORDs following 'LZXC', must be 6 if version is 2 0004: ASCII 'LZXC' Compression type identifier 0008: DWORD Version (Must be <=2) 000C: DWORD The LZX reset interval 0010: DWORD The window size 0014: DWORD The cache size 0018: DWORD 0 (unknown)
Reset interval, window size, and cache size are in bytes if version is 1, $8000-byte blocks if version is 2.
This file contains a quadword containing the uncompressed length of the section.
It appears this file was intended to contain a list of GUIDs belonging to methods of decompressing (or otherwise transforming) the section. However, it actually contains only half of the string representation of a GUID, apparently because it was sized for characters but contains wide characters.
The compressed sections are compressed using LZX, a compression method Microsoft also uses for its cabinet files. To ensure this, check the second DWORD of compression info in the ControlData file for the section — it should be 'LZXC'. To decompress, first read the file "::DataSpace/Storage/
0000: DWORD 2 unknown (possibly a version number) 0004: DWORD Number of entries in reset table 0008: DWORD 8 Size of table entry (bytes) 000C: DWORD $28 Length of table header (area before table entries) 0010: QWORD Uncompressed Length 0018: QWORD Compressed Length 0020: QWORD 0x8000 block size for locations below 0028: QWORD 0 (zeroth entry of table) 0030: QWORD location in compressed data of 1st block boundary in uncompressed data Repeat to end of file
Now you can finally obtain the section (from its Content file). The window size for the LZX compression is 16 (decimal) on all the files seen so far. This is specified by the DWORD at $10 in the ControlData file (but note that DWORD gives the window size in 0x8000-byte blocks, not the LZX code for the window size)
The rule that the input bit-stream is to be re-aligned to a 16-bit boundary after $8000 output characters have been processed IS in effect, despite this LZX not being part of a CAB file. The reset table tells you when this was done, though there is no need for that during decompression; you can just keep track of the number of output characters. Furthermore, while this does not appear to be documented in the LZX format, the uncompressed stream is padded to an $8000 byte boundary.
There is one change from LZX as defined by Microsoft: After each LZX reset interval (defined in the ControlData file, but in practice equal to the window size) of compressed data is processed, the LZX state is fully reset, as if an entirely new file was being encoded. This allows semi-random access to the compressed data; you can start reading on any reset interval boundary using the reset interval size and the reset table.
Note:
Earlier versions of this document stated that the reset interval only reset the Huffman tables and required outputting the 1-bit header again. This was erroneous. The Lempel Ziv state is reset as well. In practice, a decoder works just as well with the incorrect assumption, but encoding a file with match positions which refer to a time before the most recent LZX reset causes crashes on decoding.
The following people in (no particular order) have submitted information which has helped correct and close the gaps in this document.
And others I have not been able to reach.
Copyright 2001-2003 Matthew T. Russotto
You may freely copy and distribute unmodified copies of this file, or copies where the only modification is a change in line endings, padding after the html end tag, coding system, or any combination thereof. The original is in ASCII with Unix line endings.
An incomplete description of Microsoft's .CHM format.
A description of Microsoft's ITOL/ITLS format, which is used by HTML Help 2.0 among other things.
A set of tools for working with the CHM files, consisting of a C language library 'chmlib' and a program called 'chmdump' which dumps out the files in a CHM file.
Not everything in the document is implemented here, but it is a start, and an LZX decompression engine (from Stuart Caie's "cabextract", suitably modified) is included. License is the GPL, following "cabextract".
I also have a C++ library for reading CHM and ITOL/ITLS formats, including the ability to use arbitrary transforms in the latter.
An LGPLed LZX compression engine, suitable for creating compressed CHM files. Or for use in a CAB-making utility or for any other purpose LZX is useful for.
Documentation for the lzxcomp library included and .
Changed May 3 2100 EST: fixed a really dumb bug introduced last minute.
Also allowed LZ compressor to look into the match buffer, for a significant compression improvement.
I'm not a DD, or even in the NM queue yet. I have filed a few ITPs, found a sponsor for some debian packages, experienced the melting of NEW and the "flood" of FTBFS/etc bugreports. I plan to ITP some more packages, a couple of fonts, some of my software and some orphaned packages at some point, and enter the NM queue after a while of having sponsored packages in debian. I'm thinking that I will get more involved in QA work during that period and start sponsoring new maintainers once I become a DD.
I also maintain debian packages of some of the software I have written, which can be found on (this means you must build it yourself).
I also maintain win32 packages of the some software I have written. These packages will be uploaded to their respective download pages.
CHMLIB
is a library for dealing with Microsoft ITSS/CHM format files. Right now, it is a very simple library, but sufficient for dealing with all of the .chm files I've come across. Due to the fairly well-designed indexing built into this particular file format, even a small library is able to gain reasonably good performance indexing into ITSS archives.
Version 0.37 is primarily a security release. On October 25th, a security vulnerability was located by Sven Tantau. This release is primarily to fix this, as well as a broken Makefile.in which didn't properly install the library for people who did:
./configure; make; make install
If you did this, and were unable to subsequently build the example programs, this release should fix it for you. 0.37.2 includes yet another small patch to the Makefile.in. The change in 0.37.2 will be mainly of importance to packagers who use:
make install DESTDIR=/path/to/sandbox
as DESTDIR had been inadvertently omitted from one of the actions in the "make install" target.
In the continuing Makefile.in saga, 0.37.3 contains yet one more minor patch to make DESTDIR work properly. The symlinks were being created pointing to $(DESTDIR)$(libdir)libchm.so.0.0.0. When DESTDIR was set to a temporary build location for packaging, this meant that the symlinks were broken. Thanks to Mark Rosenstand for pointing this out and supplying a patch.
Once more with feeling! 0.37.4 contains yet another fix to the Makefile.in, from Thomas Klausner. 'make install' was not using libtool to install the shared library, which is a portability issue. (For anyone who has had difficulty with 'make install' on non-Linux platforms, this may be the cause.) Furthermore, exec_prefix was not being set, so the library itself was being installed in /lib, regardless of the chosen installation prefix.
UTF-8 support is fairly minimal at present. By this, I mean that I return the filename verbatim. Filename comparisons are done using strcasecmp, which is clearly not correct for UTF-8. I'm very interested in hearing from anyone who has dealt with internationalized filenames before, and can tell me the "right" way to deal with them. (Hopefully in a portable way.)
I've set up a sourceforge project to host this library, but I haven't really had time to move the project over. Maybe someday...
To do:
Right now this library supports enumerating the contents of the archive, and reading files from the archive.
This code is now being distributed under the LGPL. It incorporates LZX decompression code from the cabextract project. Thanks to Stuart Caie for authorizing the relicensing of this code in the context of chmlib.
Thanks to Stan Tobias for bugfixes and Andrew Hodgetts for bugfixes and portability fixes!
For those interested in the CHM format, a good resource is , which is a free software project for creating HTML Help files. More importantly, the author maintains a reverse engineered spec for HTML Help files, including the structure of the internal files, which maintain the "topic" structure of the help file, the full-text index, and other useful things. At the time of writing, the spec was not available for download; however, the author has plans to publish it on his site when it is more complete, and an offer to mail out the current version to anyone who expresses interest.
Another "free software" tool which fulfills approximately the same niche as this library may be downloaded from . If, for some reason, my library does not meet your needs, try out the chmtools from this site. Apparently, this site also offers LZX compression code.
Download version 0.37.4:
Download version 0.37:
Download version 0.36:
Download version 0.35:
Download version 0.33:
Download version 0.32:
Download version 0.31:
Download version 0.3:
Download version 0.2:
Download version 0.1:
Applications which use chmlib:
Language bindings for chmlib: