Chinaunix首页 | 论坛 | 博客
  • 博客访问: 69572
  • 博文数量: 20
  • 博客积分: 1400
  • 博客等级: 上尉
  • 技术积分: 402
  • 用 户 组: 普通用户
  • 注册时间: 2006-10-08 17:53
文章分类

全部博文(20)

文章存档

2010年(6)

2009年(2)

2008年(8)

2006年(4)

我的朋友
最近访客

分类: LINUX

2010-12-21 10:57:05

1999-08-11 - Global players need many languages. And writing systems. For Chinese, Korean, or just Greek, we need a way to code such non-ASCII characters.
1999-08-11 - 全球化的玩家需要多种语言,和书写系统. 比如 中文 韩语 或者 希腊语.我们需要一种方式去编码非ascii字符.

For a historical perspective and beginner's technical introduction, see Joel Spolky's missive at
从历史的角度 和初学者的技术指导 请看Joel Spolky的文章

The encoding standard to cover all these writing systems is the ( ), a 16 (or more) bit-wide encoding for presently 94,140 distinct coded characters derived from more than 25 supported scripts (as of Unicode 3.1).
覆盖所有书写系统的编码是Unicode. ( ) 以16位或更多编码表示94140个字符. (script是神马?)

Tcl/Tk supports the Unicode from version 8.1 as 16-bit chars or in the UTF-8 encoding as the internal representation for strings.
Tcl/Tk从8.1版本开始支持Unicode.使用16位或UTF-8编码.(内部使用UTF-8)

is made to cover 7-bit ASCII, Unicode, and its superset ISO 10646 (which offers 31 bits width, but seems to be an overkill for most practical purposes).
UTF-8 覆盖了7位ASCII, Unicode,和它的超集 ISO10646 (31位宽,但是实际情况是太恐怖了.)

Characters are represented as sequences of 1..6 eight-bit bytes - termed octets in the character set business - (for ASCII: 1, for Unicode: 2..3) as follows:
字符表示成1到6个8位的字节的序列.
  • ASCII 0x00..0x7F (Unicode page 0, left half): 0x00..0x7F. Nothing changed.
  • ASCII 0x00..0x7F (Unicode page 0, left half): 0x00..0x7F. 没任何改变
  • Unicode, pages 00..07: 2 bytes, 110aaabb 10bbbbbb, where aaa are the rightmost bits of page#, bb.. are the bits of the second Unicode byte. These pages cover European/Extended Latin, Greek, Cyrillic, Armenian, , Arabic.
  • Unicode, pages 00..07: 2 字节, 110aaabb 10bbbbbb, aaa是最右边的位 bb..是第二个Unicode字节 覆盖了欧洲拉丁字母 希腊 斯拉夫 美国 犹太
  • Unicode, pages 08..FE: 3 bytes, 1110aaaa 10aaaabb 10bbbbbb. These cover all the rest of Unicode, including Hangul, Kanji, and what else. This means that East Asian texts are 50% longer in UTF-8 than in pure 16 bit Unicode.
  • Unicode, pages 08..FE: 3 字节, 1110aaaa 10aaaabb 10bbbbbb. 覆盖了所有剩余的Unicode 包括中文 韩文等. 这意味着中文的UTF8编码将比UTF16编码长50%.(也比GB3212长50%).
  • ISO 10646 codes beyond Unicode: 4..6 bytes. (Never seen one yet).
  • ISO 10646 超越了Unicode 4到6字节 (从没见过)

A general principle of UTF-8

UTF-8 的基本原理

A general principle of UTF-8 is that the first byte either is a single-byte character (if below 0x80), or indicates length of multi-byte code by the number of 1's before the first 0 and is then filled up with data bits.
UTF-8的主要思想是第一个字节或者是单字节ASCII, 或者指示了多字节编码的数量. 二进制1的数量指示了多字节编码的字节数.

All other bytes start with bits 10 and are then filled up with 6 data bits. See also . A sequence of n bytes can hold
所有其他字节起始于 二进制10 然后填充6位数据位 参见 . 一个n字节的序列可以负载b字节数据
 b = 5n + 1  (1 < n < 7)
bits "payload", so the maximum is 31 bits for a 6-byte sequence.
所以6字节的序列最多可以表示31位.

It follows from this that bytes in UTF-8 encoding fall in distinct ranges:
字节在UTF-8编码中可以分为如下几类:
   00..7F - plain old ASCII 老的ASCII
80..BF - non-initial bytes of multibyte code 非多字节编码起始字节
C2..FD - initial bytes of multibyte code (C0, C1 are not legal!)
多字节编码的起始字节 (C0 C1是无效的)
FE, FF - never used (so, free for byte-order marks).
从来不用 所以可以作为字节序标志
The distinction between initial and non-initial helps in plausibility checks, or to re-synchronize with missing data.
起始字节和非起始字节的区别可以帮助检查错误.比如字符串中丢失一个字节的情况,GB2312编码会完全乱套.而UTF-8则不会.

Besides, it's independent of byte order (16-bit Unicode inherits byte order, so has to express that with the magic FEFF. Should you read FFFE, you're to swap).
UTF-8是字节序无关的. 而UTF-16则不是.

  • If an UCS fits 7 bits, its coded as 0xxxxxxx. This makes ASCII character represented by themselves
  • If an UCS fits 11 bits, it is coded as 110xxxxx 10xxxxxx
  • If an UCS fits 16 bits, it is coded as 1110xxxx 10xxxxxx 10xxxxxx
  • If an UCS fits 21 bits, it is coded as 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
  • If an UCS fits 26 bits, it is coded as 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
  • If an UCS fits 31 bits, it is coded as 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx

阅读(1784) | 评论(0) | 转发(1) |
给主人留下些什么吧!~~