VC知识库BLOG-遇君阁-遇君阁 - 2008年3月21日 Entries-RdsiVup-ChinaUnix博客

RdsiVup的ChinaUnix博客

首页　| 　博文目录　| 　关于我

RdsiVup

博客访问： 367919
博文数量： 715
博客积分： 40000
博客等级：大将
技术积分： 5005
用户组：普通用户
注册时间： 2008-10-13 14:46

文章分类

全部博文（715）

未分配的博文（715）

文章存档

2011年（1）

2008年（714）

我的朋友

最近访客

推荐博文

VC知识库BLOG-遇君阁-遇君阁 - 2008年3月21日 Entries

分类：

2008-10-13 16:30:08

I think all software developers know unicode. But as I know, some of them misunderstand them. Why did I say these
words. Because maybe you would think Unicode is a char set, or someone think it is a encoding methodology etc. Yes,
you are right, but partly. In wikipedia website the defination of Unicode standard is:

Unicode consists of a repertoire of about 100,000 characters, a set of code charts for visual reference, an encoding
methodology and set of standard character encodings, an enumeration of character properties such as upper and lower
case, a set of reference data computer files, and a number of related items, such as character properties, rules for
text normalization, decomposition, collation, rendering and bidirectional display order (for the correct display of
text containing both right-to-left scripts, such as Arabic or Hebrew, and left-to-right scripts).

OK, After seen this, you must say it's very complex, but I will tell you we only know two part of Unicode standard
is enough:

1. Unicode is a Char Set.
2. Unicode defines two mapping methods: the Unicode Transformation Format (UTF) encodings, and the Universal Character
Set (UCS) encodings.

In my experience, someone also misunderstand the UTF and Unicode. UTF is a encoding method for unicode, and until
now it has three kind: UTF-8, UTF-16, UTF-32. The number 8,16,32 means the Transformation Format is an octet (8-bit)
lossless encoding of Unicode characters. For example UTF-8 encodes each Unicode character as a variable number
of 1 to 4 octets, where the number of octets depends on the integer value assigned to the Unicode character. It is
an efficient encoding of Unicode documents that use mostly US-ASCII characters because it represents each character
in the range U+0000 through U+007F as a single octet.

And finally, In C/C++, it supports unicode using wchar_t. But you must note some tips of it. So what's the size of a
wchar_t then? 2 or 4 byte? The answer is yes - that is, the standards don't specify an exact length. The Unicode 4.0
standard says that "ANSI/ISO C leaves the semantics of the wide character set to the specific implementation but
requires that the characters from the portable C execution set correspond to their wide character equivalents by
zero extension." Furthermore, the standard specifies: "The width of wchar_t is compiler-specific and can be as small
as 8 bits. Consequently, programs that need to be portable across any C or C++ compiler should not use wchar_t for
storing Unicode text. The wchar_t type is intended for storing compiler-defined wide characters, which may be
Unicode characters in some compilers."

This means that a UNIX-like operating system will usually use 4 bytes (it's best to verify this by using sizeof()).
If you use the Microsoft Windwws API, you end up with 2 bytes per wchar_t.

Note, you must know the ecoding method when you meet a string.

--------------------next---------------------

阅读(156) | 评论(0) | 转发(0) |

上一篇：VC知识库BLOG-遇君阁-遇君阁 - 2008年3月5日 Entries

下一篇：constant folding

给主人留下些什么吧！~~

感谢所有关心和支持过ChinaUnix的朋友们

16024965号-6