1999-08-11 -
Global players need many languages. And writing systems. For Chinese,
Korean, or just Greek, we need a way to code such non-ASCII characters.
1999-08-11 - 全球化的玩家需要多种语言,和书写系统. 比如 中文 韩语 或者 希腊语.我们需要一种方式去编码非ascii字符.
For a historical perspective and beginner's technical introduction, see Joel Spolky's missive at
从历史的角度 和初学者的技术指导 请看Joel Spolky的文章
The encoding standard to cover all these writing systems is the ( ), a 16 (or more) bit-wide encoding for presently
94,140 distinct coded characters
derived from more than 25 supported scripts (as of Unicode 3.1).
覆盖所有书写系统的编码是Unicode. ( ) 以16位或更多编码表示
94140个字符. (script是神马?)
Tcl/Tk
supports the Unicode from version 8.1 as 16-bit chars or in the UTF-8
encoding as the internal representation for strings.
Tcl/Tk从8.1版本开始支持Unicode.使用16位或UTF-8编码.(内部使用UTF-8)
is made to cover 7-bit ASCII, Unicode, and its superset ISO 10646
(which offers 31 bits width, but seems to be an overkill for most
practical purposes).
UTF-8 覆盖了7位ASCII, Unicode,和它的超集 ISO10646 (31位宽,但是实际情况是太恐怖了.)
Characters are represented as sequences of 1..6
eight-bit bytes -
termed octets in the character set business - (for ASCII: 1, for Unicode: 2..3) as follows:
字符表示成1到6个8位的字节的序列.
- ASCII 0x00..0x7F (Unicode page 0, left half): 0x00..0x7F. Nothing changed.
- ASCII 0x00..0x7F (Unicode page 0, left half): 0x00..0x7F. 没任何改变
- Unicode,
pages 00..07: 2 bytes, 110aaabb 10bbbbbb, where aaa are the rightmost
bits of page#, bb.. are the bits of the second Unicode byte. These pages
cover European/Extended Latin, Greek, Cyrillic, Armenian, , Arabic.
- Unicode,
pages 00..07: 2 字节, 110aaabb 10bbbbbb, aaa是最右边的位 bb..是第二个Unicode字节 覆盖了欧洲拉丁字母 希腊 斯拉夫 美国 犹太
- Unicode,
pages 08..FE: 3 bytes, 1110aaaa 10aaaabb 10bbbbbb. These cover all the
rest of Unicode, including Hangul, Kanji, and what else. This means
that East Asian texts are 50% longer in UTF-8 than in pure 16 bit
Unicode.
- Unicode,
pages 08..FE: 3 字节, 1110aaaa 10aaaabb 10bbbbbb. 覆盖了所有剩余的Unicode 包括中文 韩文等. 这意味着中文的UTF8编码将比UTF16编码长50%.(也比GB3212长50%).
- ISO 10646 codes beyond Unicode: 4..6 bytes. (Never seen one yet).
- ISO 10646 超越了Unicode 4到6字节 (从没见过)
A general principle of UTF-8
UTF-8 的基本原理
A
general principle of UTF-8 is that the first byte either is a
single-byte character (if below 0x80), or indicates length of multi-byte
code by the number of 1's before the first 0 and is then filled up with
data bits.
UTF-8的主要思想是第一个字节或者是单字节ASCII, 或者指示了多字节编码的数量. 二进制1的数量指示了多字节编码的字节数.
All other bytes start with bits 10 and are then filled up
with 6 data bits. See also . A sequence of
n bytes can hold
所有其他字节起始于 二进制10 然后填充6位数据位 参见 . 一个n字节的序列可以负载b字节数据
b = 5n + 1 (1 < n < 7)
bits "payload", so the maximum is 31 bits for a 6-byte sequence.
所以6字节的序列最多可以表示31位.
It follows from this that bytes in UTF-8 encoding fall in distinct ranges:
字节在UTF-8编码中可以分为如下几类:
00..7F - plain old ASCII 老的ASCII
80..BF - non-initial bytes of multibyte code 非多字节编码起始字节
C2..FD - initial bytes of multibyte code (C0, C1 are not legal!)
多字节编码的起始字节 (C0 C1是无效的)
FE, FF - never used (so, free for byte-order marks).
从来不用 所以可以作为字节序标志
The
distinction between initial and non-initial helps in plausibility
checks, or to re-synchronize with missing data.
起始字节和非起始字节的区别可以帮助检查错误.比如字符串中丢失一个字节的情况,GB2312编码会完全乱套.而UTF-8则不会.
Besides, it's
independent of byte order (16-bit Unicode inherits byte order, so has to
express that with the magic FEFF. Should you read FFFE, you're to
swap).
UTF-8是字节序无关的. 而UTF-16则不是.
- If an UCS fits 7 bits, its coded as 0xxxxxxx. This makes
ASCII character represented by themselves
- If an UCS fits 11 bits, it is coded as 110xxxxx 10xxxxxx
- If an UCS fits 16 bits, it is coded as 1110xxxx 10xxxxxx
10xxxxxx
- If an UCS fits 21 bits, it is coded as 11110xxx 10xxxxxx
10xxxxxx 10xxxxxx
- If an UCS fits 26 bits, it is coded as 111110xx 10xxxxxx
10xxxxxx 10xxxxxx 10xxxxxx
- If an UCS fits 31 bits, it is coded as 1111110x 10xxxxxx
10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
阅读(1784) | 评论(0) | 转发(1) |