分类:
2008-10-13 16:30:08
I think all software developers know unicode. But as I know, some of them misunderstand them. Why did I say these
words. Because maybe you would think Unicode is a char set, or someone think it is a encoding methodology etc. Yes,
you are right, but partly. In wikipedia website the defination of Unicode standard is:
Unicode consists of a repertoire of about 100,000 characters, a set of code charts for visual reference, an encoding
methodology and set of standard character encodings, an enumeration of character properties such as upper and lower
case, a set of reference data computer files, and a number of related items, such as character properties, rules for
text normalization, decomposition, collation, rendering and bidirectional display order (for the correct display of
text containing both right-to-left scripts, such as Arabic or Hebrew, and left-to-right scripts).
OK, After seen this, you must say it's very complex, but I will tell you we only know two part of Unicode standard
is enough:
1. Unicode is a Char Set.
2. Unicode defines two mapping methods: the Unicode Transformation Format (UTF) encodings, and the Universal Character
Set (UCS) encodings.
In my experience, someone also misunderstand the UTF and Unicode. UTF is a encoding method for unicode, and until
now it has three kind: UTF-8, UTF-16, UTF-32. The number 8,16,32 means the Transformation Format is an octet (8-bit)
lossless encoding of Unicode characters. For example UTF-8 encodes each Unicode character as a variable number
of 1 to 4 octets, where the number of octets depends on the integer value assigned to the Unicode character. It is
an efficient encoding of Unicode documents that use mostly US-ASCII characters because it represents each character
in the range U+0000 through U+007F as a single octet.
And finally, In C/C++, it supports unicode using wchar_t. But you must note some tips of it. So what's the size of a
wchar_t then? 2 or 4 byte? The answer is yes - that is, the standards don't specify an exact length. The Unicode 4.0
standard says that "ANSI/ISO C leaves the semantics of the wide character set to the specific implementation but
requires that the characters from the portable C execution set correspond to their wide character equivalents by
zero extension." Furthermore, the standard specifies: "The width of wchar_t is compiler-specific and can be as small
as 8 bits. Consequently, programs that need to be portable across any C or C++ compiler should not use wchar_t for
storing Unicode text. The wchar_t type is intended for storing compiler-defined wide characters, which may be
Unicode characters in some compilers."
This means that a UNIX-like operating system will usually use 4 bytes (it's best to verify this by using sizeof()).
If you use the Microsoft Windwws API, you end up with 2 bytes per wchar_t.
Note, you must know the ecoding method when you meet a string.