分类: WINDOWS
2011-03-10 12:27:53
If the BOM character appears in the middle of a data stream, it should, according to Unicode, be interpreted as a "" (essentially a null character). Its deliberate use for this purpose is deprecated in Unicode 3.2, however, with the "Word Joiner" character, U+2060, strongly preferred. This allows U+FEFF to be used solely with the semantic of BOM.
UTF-8While Unicode standard allows BOM in , it does not require or recommend it. Byte order has no meaning in UTF-8 so a BOM only serves to identify a text stream or file as UTF-8 or that it was converted from another format that has a BOM. Many programs (including Windows ) add BOMs to UTF-8 files by default.
The UTF-8 representation of the BOM is the byte sequence 0xEF,0xBB,0xBF, which appears as the characters  in most and not prepared to handle UTF-8.
The reason the BOM is often recommended against is that it defeats the ASCII back-compatibility that is part of UTF-8's design. Text that is only ASCII letters stored in UTF-8 would be identical to ASCII if it were not for the BOM, thus guaranteeing compatibility. More importantly, many existing pieces of software can handle UTF-8 inside the text but not at the start. For instance the bytes of UTF-8 can be placed between the quotes of string constants in many languages, and that language will write the correct UTF-8 to a file or to a display, despite the language not knowing anything about UTF-8. This provides an easy migration path to convert systems to Unicode and to remove all legacy encodings. The unexpected three bytes of the BOM break this however, as they are located where they are certain to be a syntax error.
A leading BOM can also defeat software that uses pattern matching on the start of a text file, since it inserts 3 bytes before the pattern. Though commonly associated with the Unix at the start of an interpreted script, the problem is more widespread. For instance in custom headers on a page are recognized by the first few bytes, and a BOM inserted causes the page to be sent unchanged to the browser.
UTF-16In , a BOM (U+FEFF) may be placed as the first character of a file or character stream to indicate the endianness (byte order) of all the 16-bit code units of the file or stream.
For the registered charsets UTF-16BE and UTF-16LE, a byte order mark should not be used because the names of these character sets already determine the byte order. If encountered anywhere in such a text stream, U+FEFF is to be interpreted as a zero width no-break space.
The Unicode standard states, The UTF-16 encoding scheme may or may not begin with a BOM. However, when there is no BOM, and in the absence of a higher-level protocol, the byte order of the UTF-16 encoding scheme is big-endian. Whether or not a higher-level protocol is in force is open to interpretation. Files local to a computer for which the native byte ordering is little-endian, for example, might be argued to be encoded as UTF-16LE implicitly. Therefore the presumption of big-endian is widely ignored. When those same files are accessible on the Internet, on the other hand, no such presumption can be made.
UTF-32Although a BOM could be used with , this encoding is rarely used for transmission. Otherwise the same rules as for are applicable.
Representations of byte order marks by encodingEncoding | Representation () | Representation () | Representation () |
---|---|---|---|
EF BB BF | 239 187 191 |  | |
() | FE FF | 254 255 | þÿ |
() | FF FE | 255 254 | ÿþ |
(BE) | 00 00 FE FF | 0 0 254 255 | □□þÿ (□ is the ascii null character) |
(LE) | FF FE 00 00 | 255 254 0 0 | ÿþ□□ (□ is the ascii null character) |
2B 2F 76, and one of the following: [ 38 | 39 | 2B | 2F ] | 43 47 118, and one of the following: [ 56 | 57 | 43 | 47 ] | +/v, and one of the following: 8 9 + / | |
F7 64 4C | 247 100 76 | ÷dL | |
DD 73 66 73 | 221 115 102 115 | Ýsfs | |
0E FE FF | 14 254 255 | □þÿ (□ is the ascii "shift out" character) | |
FB EE 28 optionally followed by FF | 251 238 40 optionally followed by 255 | ûî( optionally followed by ÿ | |
84 31 95 33 | 132 49 149 51 | □1■3 (□ and ■ are unmapped ISO-8859-1 characters) |