Chinaunix首页 | 论坛 | 博客
  • 博客访问: 367919
  • 博文数量: 715
  • 博客积分: 40000
  • 博客等级: 大将
  • 技术积分: 5005
  • 用 户 组: 普通用户
  • 注册时间: 2008-10-13 14:46
文章分类

全部博文(715)

文章存档

2011年(1)

2008年(714)

我的朋友

分类:

2008-10-13 16:30:08

I think all software developers know unicode. But as I know, some of them misunderstand them. Why did I say these
words. Because maybe you would think Unicode is a char set, or someone think it is a encoding methodology etc. Yes,
you are right, but partly. In wikipedia website the defination of Unicode standard is:

Unicode consists of a repertoire of about 100,000 characters, a set of code charts for visual reference, an encoding
methodology and set of standard character encodings, an enumeration of character properties such as upper and lower
case, a set of reference data computer files, and a number of related items, such as character properties, rules for
text normalization, decomposition, collation, rendering and bidirectional display order (for the correct display of
text containing both right-to-left scripts, such as Arabic or Hebrew, and left-to-right scripts).

OK, After seen this, you must say it's very complex, but I will tell you we only know two part of Unicode standard
is enough:

1. Unicode is a Char Set.
2. Unicode defines two mapping methods: the Unicode Transformation Format (UTF) encodings, and the Universal Character
Set (UCS) encodings.

In my experience, someone also  misunderstand the UTF and Unicode. UTF is a encoding method for unicode, and until
now it  has three kind: UTF-8, UTF-16, UTF-32. The number 8,16,32 means the Transformation Format is an octet (8-bit)
 lossless encoding of Unicode characters. For example UTF-8 encodes each Unicode character as a variable number
of 1 to 4 octets, where the number of octets depends on the integer value assigned to the Unicode character. It is
an efficient encoding of Unicode documents that use mostly US-ASCII characters because it represents each character
in the range U+0000 through U+007F as a single octet.

And finally, In C/C++, it supports unicode using wchar_t. But you must note some tips of it. So what's the size of a
wchar_t then? 2 or 4 byte? The answer is yes - that is, the standards don't specify an exact length. The Unicode 4.0
standard says that "ANSI/ISO C leaves the semantics of the wide character set to the specific implementation but
requires that the characters from the portable C execution set correspond to their wide character equivalents by
zero extension." Furthermore, the standard specifies: "The width of wchar_t is compiler-specific and can be as small
as 8 bits. Consequently, programs that need to be portable across any C or C++ compiler should not use wchar_t for
storing Unicode text. The wchar_t type is intended for storing compiler-defined wide characters, which may be
Unicode characters in some compilers."

This means that a UNIX-like operating system will usually use 4 bytes (it's best to verify this by using sizeof()).
If you use the Microsoft Windwws API, you end up with 2 bytes per wchar_t.

Note, you must know the ecoding method when you meet a string.


 


--------------------next---------------------

阅读(156) | 评论(0) | 转发(0) |
给主人留下些什么吧!~~