Byte_order_mark-alanland-ChinaUnix博客

DIFFERENCEwangchengyi.blog.chinaunix.net

首页　| 　博文目录　| 　关于我

alanland

博客访问： 436909
博文数量： 114
博客积分： 3361
博客等级：中校
技术积分： 1060
用户组：普通用户
注册时间： 2010-05-18 13:14

文章分类

全部博文（114）

scala（0）
django（1）
vaadin（0）
zope（0）
javascript（0）
bugzilla（2）
C++（1）
java（4）
windows（5）
linux（5）
webservice（1）
apache（4）
python（7）
trac（14）
emacs（1）
sqlite（2）
ubuntu（7）
database（6）
cassandra（0）
闲聊（20）
vim（16）
未分配的博文（18）

文章存档

2012年（1）

2011年（84）

2010年（29）

我的朋友

相关博文

Byte_order_mark

分类： WINDOWS

2011-03-10 12:27:53

from:
Usage

If the BOM character appears in the middle of a data stream, it should, according to Unicode, be interpreted as a "" (essentially a null character). Its deliberate use for this purpose is deprecated in Unicode 3.2, however, with the "Word Joiner" character, U+2060, strongly preferred. This allows U+FEFF to be used solely with the semantic of BOM.

UTF-8

While Unicode standard allows BOM in , it does not require or recommend it. Byte order has no meaning in UTF-8 so a BOM only serves to identify a text stream or file as UTF-8 or that it was converted from another format that has a BOM. Many programs (including Windows ) add BOMs to UTF-8 files by default.

The UTF-8 representation of the BOM is the byte sequence 0xEF,0xBB,0xBF, which appears as the characters ï»¿ in most and not prepared to handle UTF-8.

The reason the BOM is often recommended against is that it defeats the ASCII back-compatibility that is part of UTF-8's design. Text that is only ASCII letters stored in UTF-8 would be identical to ASCII if it were not for the BOM, thus guaranteeing compatibility. More importantly, many existing pieces of software can handle UTF-8 inside the text but not at the start. For instance the bytes of UTF-8 can be placed between the quotes of string constants in many languages, and that language will write the correct UTF-8 to a file or to a display, despite the language not knowing anything about UTF-8. This provides an easy migration path to convert systems to Unicode and to remove all legacy encodings. The unexpected three bytes of the BOM break this however, as they are located where they are certain to be a syntax error.

A leading BOM can also defeat software that uses pattern matching on the start of a text file, since it inserts 3 bytes before the pattern. Though commonly associated with the Unix at the start of an interpreted script, the problem is more widespread. For instance in custom headers on a page are recognized by the first few bytes, and a BOM inserted causes the page to be sent unchanged to the browser.

UTF-16

In , a BOM (U+FEFF) may be placed as the first character of a file or character stream to indicate the endianness (byte order) of all the 16-bit code units of the file or stream.

If the 16-bit units are represented in byte order, this BOM character will appear in the sequence of bytes as 0xFE followed by 0xFF (where "0x" indicates );
if the 16-bit units use order, the sequence of bytes will have 0xFF followed by 0xFE.

For the registered charsets UTF-16BE and UTF-16LE, a byte order mark should not be used because the names of these character sets already determine the byte order. If encountered anywhere in such a text stream, U+FEFF is to be interpreted as a zero width no-break space.

The Unicode standard states, The UTF-16 encoding scheme may or may not begin with a BOM. However, when there is no BOM, and in the absence of a higher-level protocol, the byte order of the UTF-16 encoding scheme is big-endian. Whether or not a higher-level protocol is in force is open to interpretation. Files local to a computer for which the native byte ordering is little-endian, for example, might be argued to be encoded as UTF-16LE implicitly. Therefore the presumption of big-endian is widely ignored. When those same files are accessible on the Internet, on the other hand, no such presumption can be made.

UTF-32

Although a BOM could be used with , this encoding is rarely used for transmission. Otherwise the same rules as for are applicable.

Representations of byte order marks by encoding

Encoding	Representation ()	Representation ()	Representation ()
	EF BB BF	239 187 191	ï»¿
()	FE FF	254 255	þÿ
()	FF FE	255 254	ÿþ
(BE)	00 00 FE FF	0 0 254 255	□□þÿ (□ is the ascii null character)
(LE)	FF FE 00 00	255 254 0 0	ÿþ□□ (□ is the ascii null character)
	2B 2F 76, and one of the following: [ 38 \| 39 \| 2B \| 2F ]	43 47 118, and one of the following: [ 56 \| 57 \| 43 \| 47 ]	+/v, and one of the following: 8 9 + /
	F7 64 4C	247 100 76	÷dL
	DD 73 66 73	221 115 102 115	Ýsfs
	0E FE FF	14 254 255	□þÿ (□ is the ascii "shift out" character)
	FB EE 28 optionally followed by FF	251 238 40 optionally followed by 255	ûî( optionally followed by ÿ
	84 31 95 33	132 49 149 51	□1■3 (□ and ■ are unmapped ISO-8859-1 characters)

While identifying text as UTF-8, this is not really a "byte order" mark. Since the byte is also the word in UTF-8, there is no byte order to resolve.
In UTF-7, the fourth byte of the BOM, before encoding as , is 001111xx in binary, and xx depends on the next character (the first character after the BOM). Hence, technically, the fourth byte is not purely a part of the BOM, but also contains information about the next (non-BOM) character. For xx=00, 01, 10, 11, this byte is, respectively, 38, 39, 2B, or 2F when encoded as base64. If no following character is encoded, 38 is used for the fourth byte and the following byte is 2D.
SCSU allows other encodings of U+FEFF, the shown form is the signature recommended in UTR #6.
For BOCU-1 a signature changes the state of the decoder. Octet 0xFF resets the decoder to the initial state.