Chinaunix首页 | 论坛 | 博客
  • 博客访问: 353524
  • 博文数量: 89
  • 博客积分: 2919
  • 博客等级: 少校
  • 技术积分: 951
  • 用 户 组: 普通用户
  • 注册时间: 2006-05-23 11:51
个人简介

好懒,什么都没写

文章分类

全部博文(89)

文章存档

2012年(3)

2011年(17)

2007年(20)

2006年(49)

我的朋友

分类:

2007-08-27 15:01:51

字符集,字符的码,编码方式 --一直没有搞清楚它们之间的区别和联系。最近工作做连续遇到这方面的困扰,终于决心,把它们搞清楚了~~!
原文地址:~jkorpela/chars.html
%7Ejkorpela/chars.html#control
如果你看明白来,不妨为浏览器做个编码自动识别程序~~!Mozilla的对应程序地址为:

A tutorial on character code issues

  • The basics
  • Definitions: character repertoire, character code, character encoding
  • Examples of character codes
    • Good old ASCII
    • Another example: ISO Latin 1 alias ISO 8859-1
    • More examples: the Windows character set(s)
    • The ISO 8859 family
    • Other "extensions to ASCII"
    • Other "8-bit codes"
    • ISO 10646 (UCS) and Unicode
  • More about the character concept
    • The Unicode view
    • Control characters (control codes)
    • A glyph - a visual appearance
    • What's in a name?
    • Glyph variation
    • Fonts
    • Identity of characters: a matter of definition
    • Failures to display a character
    • Linear text vs. mathematical notations
    • Compatibility characters
    • Compositions and decompositions
  • Typing characters
    • Just pressing a key?
    • Program-specific methods for typing characters
    • "Escape" notations ("meta notations") for characters
    • How to mention (identify) a character
  • Information about encoding
    • The need for information about encoding
    • The MIME solution
    • An auxiliary encoding: Quoted-Printable (QP)
    • How MIME should work in practice
    • Problems with implementations - examples
  • Practical conclusions
  • Further reading

This document tries to clarify the concepts of , , and especially in the Internet context. It specifically avoids the term character set, which is confusingly used to denote repertoire or code or encoding. , , (ISO Latin, especially ), , , , , and are used as examples. This document in itself does not contain solutions to practical problems with character codes (but see section ). Rather, it gives background information needed for understanding what solutions there might be, what the different solutions do - and what's really the problem in the first place.

If you are looking for some quick help in using a large character repertoire in HTML authoring, see the document .

Several technical terms related to character sets (e.g. glyph, encoding) can be difficult to understand, due to various confusions and due to having different names in different languages and contexts. The online database can be useful: it contains translations and definitions for several technical terms used here. You may wish to use the following simplified search form to access EuroDicAutom:

Looking for equivalents of in

The numerical values are presented in the normal (decimal) notation here, but notice that other presentations are used too, especially (base 8) or (base 16) notation. Octets are often called bytes, but in principle, octet is a more definite concept than . Internally, octets consist of eight s (hence the name, from Latin octo 'eight'), but we need not go into bit level here. However, you might need to know what the phrase "first bit set" or "sign bit set" means, since it is often used. In terms of numerical values of octets, it means that the value is greater than 127. In various contexts, such octets are sometimes interpreted as negative numbers, and this may cause various problems.

Different conventions can be established as regards to how an octet or a sequence of octets presents some data. For instance, four consecutive octets often form a unit that presents a real number according to a specific standard. We are here interested in the presentation of character data (or string data; a string is a sequence of characters) only.

In the simplest case, which is still widely used, one octet corresponds to one character according to some mapping table (encoding). Naturally, this allows at most 256 different characters being represented. There are several different encodings, such as the well-known encoding and the of encodings. The correct interpretation and processing of character data of course requires knowledge about the encoding used. For HTML documents, such information should be sent by the Web server along with the document itself, using so-called (cf. to ).

Previously the encoding was usually assumed by default (and it is still very common). Nowadays , which can be regarded as an , is often the default. The current trend is to avoid giving such a special position to ISO Latin 1 among the variety of encodings.

The following definitions are not universally accepted and used. In fact, one of the greatest causes of confusion around character set issues is that terminology varies and is sometimes misleading.

A set of distinct characters. No specific internal presentation in computers or data transfer is assumed. The repertoire per se does not even define an ordering for the characters; ordering for sorting and other purposes is to be specified separately. A character repertoire is usually defined by specifying of characters and a sample (or reference) presentation of characters in visible form. Notice that a character repertoire may contain characters which look the same in some presentations but are regarded as logically distinct, such as Latin uppercase A, Cyrillic uppercase A, and Greek uppercase alpha. For more about this, see a discussion of the character concept later in this document.
A mapping, often presented in tabular form, which defines a one-to-one correspondence between characters in a character and a set of nonnegative integers. That is, it assigns a unique numerical code, a code position, to each character in the repertoire. In addition to being often presented as one or more tables, the code as a whole can be regarded as a single table and the code positions as indexes. As synonyms for "code position", the following terms are also in use: code number, code value, code element, code point, code set value - and just code. Note: The set of nonnegative integers corresponding to characters need not consist of consecutive numbers; in fact, most character codes have "holes", such as code positions reserved for or for eventual future use to be defined later.
A method (algorithm) for presenting characters in digital form by mapping sequences of of characters into sequences of . In the simplest case, each character is mapped to an integer in the range 0 - 255 according to a character code and these are used as such as octets. Naturally, this only works for character s with at most 256 characters. For larger sets, more complicated encodings are needed. .

Notice that a character code assumes or implicitly defines a character repertoire. A character encoding could, in principle, be viewed purely as a method of mapping a sequence of integers to a sequence of octets. However, quite often an encoding is specified in terms of a character code (and the implied character repertoire). The logical structure is still the following:

  1. A character repertoire specifies a collection of characters, such as "a", "!", and "ä".
  2. A character code defines numeric codes for characters in a repertoire. For example, in the character code the numeric codes for "a", "!", "ä", and "‰" (per mille sign) are 97, 33, 228, and 8240. (Note: Especially the per mille sign, presenting 0/00 as a single character, can be shown incorrectly on display or on paper. That would be an illustration of the symptoms of the problems we are discussing.)
  3. A character encoding defines how sequences of numeric codes are presented as (i.e., mapped to) sequences of octets. In one possible encoding for , the string a!ä‰ is presented as the following sequence of octets (using two octets for each character): 0, 97, 0, 33, 0, 228, 32, 48.

For a more rigorous explanation of these basic concepts, see .

The phrase character set is used in a variety of meanings. It might denotes just a character repertoire but it may also refer to a character code, and quite often a particular character encoding is implied too.

Unfortunately the word charset is used to refer to an encoding, causing much confusion. It is even the official term to be used in several contexts by Internet protocols, in headers.

Quite often the choice of a character repertoire, code, or encoding is presented as the choice of a language. For example, Web browsers typically confuse things quite a lot in this area. A pulldown menu in a program might be labeled "Languages", yet consist of character encoding choices (only). A language setting is quite distinct from character issues, although naturally each language has its own requirements on character repertoire. Even more seriously, programs and their documentation very often confuse the above-mentioned issues with the selection of a .

The basics of ASCII

The name ASCII, originally an abbreviation for "American Standard Code for Information Interchange", denotes an old character , , and .

Most character codes currently in use contain ASCII as their subset in some sense. ASCII is the safest character repertoire to be used in data transfer. However,

ASCII has been used and is used so widely that often the word ASCII refers to "text" or "plain text" in general, even if the character code is something else! The words "ASCII file" quite often mean any text file as opposite to a binary file.

The definition of ASCII also specifies a set of ("control characters") such as linefeed (LF) and escape (ESC). But the character repertoire proper, consisting of the printable characters of ASCII, is the following (where the first item is the blank, or space, character):

  ! " # $ % & ' ( ) * + , - . /
0 1 2 3 4 5 6 7 8 9 : ; < = > ?
@ A B C D E F G H I J K L M N O
P Q R S T U V W X Y Z [ \ ] ^ _
` a b c d e f g h i j k l m n o
p q r s t u v w x y z { | } ~

The appearance of characters varies, of course, especially for some special characters. Some of the variation and other details are explained in .

A formal view on ASCII

The character code defined by the ASCII standard is the following: code values are assigned to characters consecutively in the order in which the characters are listed above (rowwise), starting from 32 (assigned to the blank) and ending up with 126 (assigned to the tilde character ~). Positions 0 through 31 and 127 are reserved for . They have standardized , but in fact their usage varies a lot.

The character encoding specified by the ASCII standard is very simple, and the most obvious one for any character code where the code numbers do not exceed 255: each code number is presented as an octet with the same value.

Octets 128 - 255 are not used in ASCII. (This allows programs to use the first, most significant bit of an octet as a bit, for example.)

There are several national variants of ASCII. In such variants, some special characters have been replaced by national letters (and other symbols). There is great variation here, and even within one country and for one language there might be different variants. The original ASCII is therefore often referred to as US-ASCII; the formal standard (by ) is ANSI X3.4-1986.

The phrase "original ASCII" is perhaps not quite adequate, since the creation of ASCII started in late 1950s, and several additions and modifications were made in the 1960s. The had several unassigned code positions. The ANSI standard, where those positions were assigned, mainly to accommodate lower case letters, was approved in 1967/1968, later modified slightly. For the early history, including pre-ASCII character codes, see Steven J. Searle's and Tom Jennings' . See also 's , Mary Brandel's , and the computer history documents, including the background and creation of ASCII, written by , "father of ASCII".

defines a character set similar to but with corresponding to US-ASCII characters @[\]{|} as "national use positions". It also gives some liberties with characters #$^`~. The standard also defines "international reference version (IRV)", which is (in the 1991 edition of ISO 646) identical to US-ASCII. has issued the standard, which is equivalent to ISO 646 and is freely available on the Web.

Within the framework of ISO 646, and partly otherwise too, several "national variants of ASCII" have been defined, assigning different letters and symbols to the "national use" positions. Thus, the characters that appear in those positions - including those in US-ASCII - are somewhat "unsafe" in international data transfer, although this problem is losing significance. The trend is towards using the corresponding codes strictly for US-ASCII meanings; national characters are handled otherwise, giving them their own, unique and universal code positions in character codes larger than ASCII. But old software and devices may still reflect various "national variants of ASCII".

The following table lists ASCII characters which might be replaced by other characters in national variants of ASCII. (That is, the code positions of these US-ASCII characters might be occupied by other characters needed for national use.) The lists of characters appearing in national variants are not intended to be exhaustive, just typical examples.

dec  oct  hex  official name National variants

 35   43  23   #   £ Ù
 36   44  24   $   ¤
 64  100  40   @   É § Ä à ³
 91  133  5B   [   Ä Æ ° â ¡ ÿ é
 92  134  5C   \   Ö Ø ç Ñ ½ ¥
 93  135  5D   ]   Å Ü § ê é ¿ |
 94  136  5E   ^   circumflex accent Ü î
 95  137  5F   _   è
 96  140  60   `   é ä µ ô ù
123  173  7B   {   left curly bracket ä æ é à ° ¨
124  174  7C   |   ö ø ù ò ñ f
125  175  7D   }   right curly bracket å ü è ç ¼
126  176  7E   ~   ü ¯ ß ¨ û ì ´ _

Almost all of the characters used in the national variants have been incorporated into . Systems that support ISO Latin 1 in principle may still reflect the use of national variants of ASCII in some details; for example, an ASCII character might get printed or displayed according to some national variant. Thus, even "plain ASCII text" is thereby not always portable from one system or application to another.

More information about national variants and their impact:

  • : ; contains a comparison table of national variants
  • by
  • The by Roman Czyborra
  • by .

Mainly due to the discussed above, some characters are less "safe" than other, i.e. more often transferred or interpreted incorrectly.

In addition to the letters of the English alphabet ("A" to "Z", and "a" to "z"), the digits ("0" to "9") and the space (" "), only the following characters can be regarded as really "safe" in data transmission:

! " % & ' ( ) * + , - . / : ; < = > ?

Even these characters might eventually be interpreted wrongly by the recipient, e.g. by a human reader seeing a for "&" as something else than what it is intended to denote, or by a program interpreting "<" as starting some special , "?" as being a so-called character, etc.

When you need to name things (e.g. files, variables, data fields, etc.), it is often best to use only the characters listed above, even if a wider character repertoire is possible. Naturally you need to take into account any additional restrictions imposed by the applicable syntax. For example, the rules of a programming language might restrict the character repertoire in identifier names to letters, digits and one or two other characters.

Sometimes the phrase "8-bit ASCII" is used. It follows from the discussion above that in reality ASCII is strictly and unambiguously a 7-bit code in the sense that all code positions are in the range 0 - 127.

It is a misnomer used to refer to various character which are extensions of in the following sense: the character repertoire contains ASCII as a subset, the code numbers are in the range 0 - 255, and the code numbers of ASCII characters equal their ASCII codes.

The ISO 8859-1 standard (which is part of the of standards) defines a character identified as "Latin alphabet No. 1", commonly called "ISO Latin 1", as well as a character for it. The repertoire contains the repertoire as a subset, and the code numbers for those characters are the same as in ASCII. The standard also specifies an , which is similar to that of ASCII: each code number is presented simply as one octet.

In addition to the ASCII characters, ISO Latin 1 contains various accented characters and other letters needed for writing languages of Western Europe, and some special characters. These characters occupy code positions 160 - 255, and they are:

  ¡ ¢ £ ¤ ¥ ¦ § ¨ © ª « ¬ ­ ® ¯
° ± ² ³ ´ µ ¶ · ¸ ¹ º » ¼ ½ ¾ ¿
À Á Â Ã Ä Å Æ Ç È É Ê Ë Ì Í Î Ï
Ð Ñ Ò Ó Ô Õ Ö × Ø Ù Ú Û Ü Ý Þ ß
à á â ã ä å æ ç è é ê ë ì í î ï
ð ñ ò ó ô õ ö ÷ ø ù ú û ü ý þ ÿ

Notes:

  • The first of the characters above appears as space; it is the so-called .
  • The presentation of some characters in copies of this document may be defective e.g. due to lack of support. You may wish to compare the presentation of the characters on your browser with the character table presented as a GIF image in the famous document. (In text only mode, you may wish to use my simple which contains the names of the characters.)
  • Naturally, the appearance of characters varies from one to another.

See also: , which presents detailed characterizations of the meanings of the characters and comments on their usage in various contexts.

In , code positions 128 - 159 are explicitly reserved for ; they "correspond to bit combinations that do not represent graphic characters". The so-called Windows character set (WinLatin1, or , to be exact) uses some of those positions for printable characters. Thus, the Windows character set is not identical with . It is, however, true that the Windows character set is much more similar to ISO 8859-1 than the so-called are. The Windows character set is often called "ANSI character set", but this is seriously misleading. It has not been approved by . (Historical background: Microsoft based the design of the set on a draft for an ANSI standard. A glossary by Microsoft explicitly admits this.)

Note that programs used on Windows systems may use a DOS character set; for example, if you create a text file using a Windows program and then use the type command on DOS prompt to see its content, strange things may happen, since the DOS command interprets the data according to a DOS character code.

In the Windows character set, some positions in the range 128 - 159 are assigned to printable characters, such as "smart quotes", em dash, en dash, and trademark symbol. Thus, the character repertoire is larger than . The use of octets in the range 128 - 159 in any data to be processed by a program that expects ISO 8859-1 encoded data is an error which might cause just anything. They might for example get ignored, or be processed in a manner which looks meaningful, or be interpreted as . See my document for a discussion of the problems of using these characters.

The Windows character set exists in different variations, or "code pages" (CP), which generally differ from the corresponding ISO 8859 standard so that it contains same characters in positions 128 - 159 as code page 1252. (However, there are some more .) See by Roman Czyborra and Windows codepages by . See also . What we have discussed here is the most usual one, resembling ISO 8859-1. Its status in the was unclear; an encoding had been registered under the name ISO-8859-1-Windows-3.1-Latin-1 by Hewlett-Packard (!), assumably intending to refer to WinLatin1, but in 1999-12 it under the name windows-1252. That name has in fact been widely used for it. (The name cp-1252 has been used too, but it isn't officially registered even as an alias name.)

There are several character codes which are extensions to in the same as and the .

ISO 8859 family of character codes, which is nicely overviewed in Roman Czyborra's famous document . The ISO 8859 codes extend the repertoire in different ways with different special characters (used in different languages and cultures). Just as ISO 8859-1 contains ASCII characters and a collection of characters needed in languages of western (and northern) Europe, there is ISO 8859-2 alias ISO Latin 2 constructed similarly for languages of central/eastern Europe, etc. The ISO 8859 character codes are isomorphic in the following sense: code positions 0 - 127 contain the same character as in ASCII, positions 128 - 159 are unused (reserved for ), and positions 160 - 255 are the varying part, used differently in different members of the ISO 8859 family.

The ISO 8859 character codes are normally presented using the obvious encoding: each code position is presented as one octet. Such encodings have several alternative names in the official , but the preferred ones are of the form ISO-8859-n.

Although ISO 8859-1 has been a de facto default encoding in many contexts, it has in principle no special role. was expected to replace ISO 8859-1 to a great extent, since it contains the politically important symbol for , but it seems to have little practical use.

The following table lists the ISO 8859 alphabets, with links to more detailed descriptions. There is a separate document which you might use to determine which (if any) of the alphabets are suitable for a document in a given language or combination of languages. My contains a combined character table, too.

The parts of ISO 8859
standard name of alphabet characterization
"Western", "West European"
Latin alphabet No. 2 "Central European", "East European"
Latin alphabet No. 3 "South European"; "Maltese & Esperanto"
Latin alphabet No. 4 "North European"
Latin/Cyrillic alphabet (for Slavic languages)
Latin/Arabic alphabet (for the Arabic language)
Latin/Greek alphabet (for modern Greek)
Latin/Hebrew alphabet (for Hebrew and Yiddish)
Latin alphabet No. 5 "Turkish"
Latin alphabet No. 6 "Nordic" (Sámi, Inuit, Icelandic)
Latin/Thai alphabet (for the Thai language)

Latin alphabet No. 7 Baltic Rim
Latin alphabet No. 8 Celtic
"euro"
ISO 8859-16 Latin alphabet No. 10 for South-Eastern Europe (see below)

Notes: ISO 8859-n is Latin alphabet is for use in Albanian, Croatian, English, Finnish, French, German, Hungarian, Irish Gaelic (new orthography), Italian, Latin, Polish, Romanian, and Slovenian. In particular, it contains letters s and t with comma below, in order to address . See the site for the current status and proposed changes to the ISO 8859 set of standards.

In addition to the codes discussed above, there are other extensions to ASCII which utilize the code range 0 - 255 (), such as

, or "code pages" (CP)
In systems, different character codes are used; they are called "code pages". The original American code page was CP 437, which has e.g. some Greek letters, mathematical symbols, and characters which can be used as elements in simple pseudo-graphics. Later CP 850 became popular, since it contains letters needed for West European languages - largely the same letters as , but in different code positions. See for detailed information. Note that DOS code pages are quite different from , though the latter are sometimes called with names like cp-1252 (= windows-1252)! For further confusion, Microsoft now prefers to use the notion "OEM code page" for the DOS character set used in a particular country.
On the , the character code is more uniform than on PCs (although there are some national variants). The Mac character repertoire is a mixed combination of ASCII, accented letters, mathematical symbols, and other ingredients. See section Text in Mac OS 8 and 9 Developer Documentation.

Notice that many of these are very different from ISO 8859-1. They may have different character repertoires, and the same character often has different code values in different codes. For example, code position 228 is occupied by ä (letter a with dieresis, or umlaut) in ISO 8859-1, by ð (Icelandic letter eth) in HP's , by õ (letter o with tilde) in DOS code page 850, and per mille sign (‰) in Macintosh character code.

For information about several code pages, see by Roman Czyborra. See also his excellent , such as different variants of KOI-8; most of them are extensions to ASCII, too.

In general, full conversions between the character codes mentioned above are not possible. For example, the Macintosh character repertoire contains the Greek letter pi, which does not exist in at all. Naturally, a text can be converted (by a simple program which uses a conversion table) from Macintosh character code to ISO 8859-1 if the text contains only those characters which belong to the ISO Latin 1 character repertoire. Text presented in can be used as such as ISO 8859-1 encoded data if it contains only those characters which belong to the ISO Latin 1 character repertoire.

All the character codes discussed above are "8-bit codes", eight bits are sufficient for presenting the and in practice the (at least the normal encoding) is the obvious (trivial) one where each code position (thereby, each character) is presented as one octet (byte). This means that there are 256 code positions, but several positions are reserved for or left unused (unassigned, undefined).

Although currently most "8-bit codes" are in the sense described above, this is just a practical matter caused by the widespread use of . It was practical to make the "lower halves" of the character codes the same, for several reasons.

and define a general framework for 8-bit codes (and 7-bit codes) and for switching between them. One of the basic ideas is that code positions 128 - 159 (decimal) are reserved for use as ("C1 controls"). Note that the character sets do not comply with this principle.

code, defined by and once in widespread use on "" (and still in use). EBCDIC contains all ASCII characters but in quite different . As an interesting detail, in EBCDIC normal letters A - Z do not all appear in consecutive code positions. EBCDIC exists in different national variants (cf. to ). For more information on EBCDIC, see section in 's .

ISO 10646, the standard

ISO 10646 (officially: ISO/IEC 10646) is an international standard, by and . It defines UCS, Universal Character Set, which is a very large and growing , and a for it. Currently tens of thousands of characters have been defined, and new amendments are defined fairly often. It contains, among other things, all characters in the character repertoires discussed above. For a list of the character blocks in the repertoire, with examples of some of them, see the document .

The number of the standard intentionally reminds us of 646, the number of the ISO standard corresponding to .

Unicode is a , by the , which defines a character repertoire and character code intended to be fully compatible with ISO 10646, and an encoding for it. ISO 10646 is more general (abstract) in nature, whereas Unicode "imposes additional constraints on implementations to ensure that they treat characters uniformly across platforms and applications", as they say in section of the .

Unicode was originally designed to be a 16-bit code, but it was extended so that currently code positions are expressed as integers in the hexadecimal range 0..10FFFF (decimal 0..1 114 111). That space is divided into 16-bit "planes". Until recently, the use of Unicode has mostly been limited to "Basic Multilingual Plane (BMP)" consisting of the range 0..FFFF.

The ISO 10646 and Unicode character repertoire can be regarded as a superset of most character repertoires in use. However, the code positions of characters vary from one character code to another.

In practice, people usually talk about Unicode rather than ISO 10646, partly because we prefer names to numbers, partly because Unicode is more explicit about the meanings of characters, partly because detailed information about Unicode is available on the Web (see below).

Unicode version 1.0 used somewhat different for some characters than ISO 10646. In Unicode version, 2.0, the names were made the same as in ISO 10646. New of Unicode are expected to add new characters mostly. , with a total number of 49,194 characters (38,887 in version 2.1), was published in February 2000, and version 4.0 has 96,248 characters.

Until recently, the ISO 10646 standard had not been put onto the Web. It is now available as a large (80 megabytes) zipped PDF file via the page of ISO/IEC JTC1. page. It is available in printed form from . But for most practical purposes, the same information is in the Unicode standard.

For more information, see

  • by the Unicode Consortium. It is fairly large but divided into sections rather logically, except that section would be better labeled as "Miscellaneous".
  • Roman Czyborra's material on Unicode, such as and
  • Olle Järnefors: . Very readable and informative, though somewhat outdated e.g. as regards to . (It also contains a more detailed technical description of the UTF encodings than those given above.)
  • : . Contains helpful general explanations as well as practical implementation considerations.
  • Steven J. Searle: . Contains a valuable historical review, including critical notes on the "unification" of Chinese, Japanese and Korean (CJK) characters.
  • : ; some software tools for actually writing Unicode; I'd especially recommend taking a look at the free editor (for Windows).
There are also some books on Unicode:
  • Jukka K. Korpela: . O’Reilly, 2006.
  • Tony Graham: . Wiley, 2000.
  • Richard Gillam: . Addison-Wesley, 2002.

  • : the standard itself, mostly in PDF format; it's partly hard to read, so you might benefit from my , which briefly explains the structure of the standard and how to find information about a particular character there
  • , the Unicode standard in French
  • , containing , , and representative for the characters and notes on their usage. Available in PDF format, containing the same information as in the corresponding parts of the printed standard. (The charts were previously available in faster-access format too, as HTML documents containing the charts as GIF images. But this version seems to have been removed.)
  • , a large (over 460 000 ) plain text file listing Unicode character , , and defined character in a compact
  • to ISO 10646-1:1993 (i.e., old version!), which lists, in alphabetic order, all character (and the ) except Hangul and CJK ideographs; useful for finding out the code position when you know the (right!) name of a character.
  • by Indrek Hein at the . You can e.g. search for characters by name or code position and get the Unicode equivalents of characters in many widely used character sets.
    This simple interface to the database lets you retrieve information about a Unicode character by code position (to be specified in hexadecimal, with four digits, as in U+nnnn): U+
  • ; contains some additional information on how to find a Unicode number for a character
  • Originally, before extending the code range past 16 bits, the "native" Unicode encoding was UCS-2, which presents each code number as two consecutive octets m and n so that the number equals 256m+n. This means, to express it in computer jargon, that the code number is presented as a two-byte integer. According to the Unicode consortium, the term UCS-2 should now be avoided, as it is associated with the 16-bit limitations.

    UTF-32 encodes each code position as a 32-bit binary integer, i.e. as four octets. This is a very obvious and simple encoding. However, it is inefficient in terms of the number of octets needed. If we have normal English text or other text which contains characters only, the length of the Unicode encoded octet sequence is four times the length of the string in ISO 8859-1 encoding. UTF-32 is rarely used, except perhaps in internal operations (since it is very simple for the purposes of string processing).

    UTF-16 represents each code position in the Basic Multilingual Plane as two octets. Other code positions are presented using so-called surrogate pairs, utilizing some code positions in the BMP reserved for the purpose. This, too, is a very simple encoding when the data contains BMP characters only.

    can be, and often is, encoded in other ways, too, such as the following encodings:

    Character codes less than 128 (effectively, the repertoire) are presented "as such", using one octet for each code (character) All other codes are presented, according to a relatively complicated method, so that one code (character) is presented as a sequence of two to four octets, each of which is in the range 128 - 255. This means that in a sequence of octets, octets in the range 0 - 127 ("bytes with most significant bit set to 0") directly represent characters, whereas octets in the range 128 - 255 ("bytes with most significant bit set to 1") are to be interpreted as really encoded presentations of characters.
    UTF-7
    Each character code is presented as a sequence of one or more octets in the range 0 - 127 ("bytes with most significant bit set to 0", or "seven-bit bytes", hence the name). Most characters are presented as such, each as one octet, but for obvious reasons some octet values must be reserved for use as "escape" octets, specifying the octet together with a certain number of subsequent octets forms a multi-octet encoded presentation of one character. There is an later in this document.

    IETF Policy on Character Sets and Languages () clearly favors UTF-8. It requires support to it in Internet protocols (and doesn't even mention UTF-7). Note that UTF-8 is efficient, if the data consists dominantly of ASCII characters with just a few "special characters" in addition to them, and reasonably efficient for dominantly ISO Latin 1 text.

    The implementation of Unicode support is a long and mostly gradual process. Unicode can be supported by programs on any operating systems, although some systems may allow much easier implementation than others; this mainly depends on whether the system uses Unicode internally so that support to Unicode is "built-in".

    Even in circumstances where Unicode is supported in principle, the support usually does not cover all Unicode characters. For example, a available may cover just some part of Unicode which is practically important in some area. On the other hand, for data transfer it is essential to know which Unicode characters the recipient is able to handle. For such reasons, various subsets of the Unicode character repertoire have been and will be defined. For example, the Minimum European Subset specified by was intended to provide a first step towards the implementation of large character sets in Europe. It was replaced by (MES-1, MES-2, MES-3, with MES-2 based on the Minimum European Subset), defined in a CEN Workshop Agreement, namely CWA 13873.

    A practically important one is Microsoft's , or "PanEuropean" character set, characterized on Microsoft's page and excellently listed on page by .

    Unicode characters are often referred to using a notation of the form U+nnnn where nnnn is a four-digit notation of the code value. For example, U+0020 means the space character (with code value 20 in hexadecimal, 32 in decimal). Notice that such notations identify a character through its Unicode code value, without referring to any particular encoding. There are other , too.

    An "A" (or any other character) is something like a Platonic entity: it is the idea of an "A" and not the "A" itself.
    -- Michael E. Cohen: .

    The character concept is very fundamental for the issues discussed here but difficult to define exactly. The more fundamental concepts we use, the harder it is to give good definitions. (How would you define "life"? Or "structure"?) Here we will concentrate on clarifying the character concept by indicating what it does not imply.

    The standard describes characters as "the smallest components of written language that have semantic value", which is somewhat misleading. A character such as a letter can hardly be described as having a meaning (semantic value) in itself. Moreover, a character such as (letter u with acute accent), which belongs to Unicode, can often be regarded as consisting of smaller components: a letter and a . And in fact the very definition of the character concept in Unicode is the following:

    abstract character: a unit of information used for the organization, control, or representation of textual data.

    (In Unicode terminology, "abstract character" is a character as an element of a character repertoire, whereas "character" refers to "coded character representation", which effectively means a code value. It would be natural to assume that the opposite of an abstract character is a concrete character, as something that actual appears in some physical form on paper or screen; but oh no, the Unicode concept "character" is more concrete than an "abstract character" only in the sense that it has a fixed code position! An actual physical form of an abstract character, with a specific shape and size, is a . Confusing, isn't it?)

    The rôle of the so-called control characters in character codes is somewhat obscure. Character codes often contain code positions which are not assigned to any visible character but reserved for control purposes. For example, in communication between a terminal and a computer using the code, the computer could regard  3 as a request for terminating the currently running process. Some older character code standards contain explicit descriptions of such conventions whereas newer standards just reserve some positions for such usage, to be defined in such as (tabulated in my document on ) and , or specifically . And although the definition quoted above suggests that "control characters" might be regarded as characters in the Unicode terminology, perhaps it is more natural to regard them as .

    Control codes can be used for device control such as cursor movement, page eject, or changing colors. Quite often they are used in combination with codes for graphic characters, so that a device is expected to interpret the combination as a specific command and not display the graphic character(s) contained in it. For example, in the classical , ESC followed by the code corresponding to the letter "A" or something more complicated (depending on mode settings) moves the cursor up. To take a different example, the editor treats ESC A as a request to move to the beginning of a sentence. Note that the ESC control code is logically distinct from the ESC key in a keyboard, and many other things than pressing ESC might cause the ESC control code to be sent. Also note that phrases like are often used to refer to things that don't involve ESC at all and operate at a quite different level. , the inventor of ESC, has written a "vignette" about it: .

    One possible form of device control is changing the way a device interprets the data (octets) that it receives. For example, a control code followed by some data in a specific format might be interpreted so that any subsequent octets to be interpreted according to a table identified in some specific way. This is often called "code page switching", and it means that control codes could be used change the character encoding. And it is then more logical to consider the control codes and associated data at the level of fundamental interpretation of data rather than direct device control. The international standard defines powerful facilities for using different 8-bit character codes in a document.

    Widely used formatting control codes include carriage return (CR), linefeed (LF), and horizontal tab (HT), which in occupy code positions 13, 10, and 9. The names (or abbreviations) suggest generic meanings, but the actual meanings are defined partly in each character code definition, partly - and more importantly - by various other conventions "above" the character level. The "formatting" codes might be seen as a special case of device control, in a sense, but more naturally, a CR or a LF or a CR LF pair (to mention the most common conventions) when used in a text file simply indicates a new line. As regards to control codes used for line structuring, see Unicode technical report #13 . See also my . The is often used for real "tabbing" to some predefined writing position. But it is also used e.g. for indicating data boundaries, without any particular presentational effect, for example in the widely used "tab separated values" () data format.

    A control code, or a "control character" cannot have a graphic presentation (a ) in the same way as normal characters have. However, in there is a separate block which contains characters that can be used to indicate the presence of a control code. For example, the symbol for escape contains the letters E, S, C inan descending sequence. They are of course quite distinct from the control codes they symbolize - symbol for escape is not the same as escape! On the other hand, a control code might occasionally be displayed, by some programs, in a visible form, perhaps describing the control action rather than the code. For example, upon receiving octet 3 in the example situation above, a program might echo back (onto the terminal) *** or INTERRUPT or ^C. All such notations are program-specific conventions. Some control codes are sometimes named in a manner which seems to bind them to characters. In particular, control codes 1, 2, 3, ... are often called control-A, control-B, control-C, etc. (or CTRL-A or C-A or whatever). This is associated with the fact that on many keyboards, control codes can be produced (for sending to a computer) using a special key labeled "Control" or "Ctrl" or "CTR" or something like that together with letter keys A, B, C, ... This in turn is related to the fact that the of characters and control codes have been assigned so that the code of "Control-X" is obtained from the code of the upper case letter X by a simple operation (subtracting 64 decimal). But such things imply no real relationships between letters and control codes. The control code 3, or "Control-C", is not a variant of letter C at all, and its meaning is not associated with the meaning of C.

    Example: a letter and different glyphs for it
    latin capital letter z (U+00E9)
    Z Z Z Z Z

    It is important to distinguish the character concept from the glyph concept. A glyph is a presentation of a particular shape which a character may have when rendered or displayed. For example, the character Z might be presented as a boldface Z or as an italic Z, and it would still be a presentation of the same character. On the other hand, lower-case z is defined to be a separate character - which in turn may have different glyph presentations.

    This is ultimately a matter of definition: a definition of a character repertoire specifies the "identity" of characters, among other things. One could define a repertoire where uppercase Z and lowercase z are just two glyphs for the same character. On the other hand, one could define that italic Z is a character different from normal Z, not just a different glyph for it. In fact, in for example there are several characters which could be regarded as typographic variants of letters only, but for various reasons Unicode defines them as separate characters. For example, mathematicians use a variant of letter N to denote the set of natural numbers (0, 1, 2, ...), and this variant is defined as being a separate character ("double-struck capital N") in Unicode. There are some more below.

    The design of glyphs has several aspects, both practical and esthetic. For an interesting review of a major company's description of its principles and practices, see Microsoft's Character design standards (in its ).

    Some discussions, such as ISO 9541-1 and , make a further distinction between "glyph image", which is an actual appearance of a glyph, and "glyph", which is a more abstract notion. In such an approach, "glyph" is close to the concept of "character", except that a glyph may present a combination of several characters. Thus, in that approach, the abstract characters "f" and "i" might be represented using an abstract glyph that combines the two characters into a ligature, which itself might have different physical manifestations. Such approaches need to be treated as different from the issue of treating ligatures as (compatibility) characters.

    The names of characters are assigned identifiers rather than definitions. Typically the names are selected so that they contain only letters A - Z, spaces, and hyphens; often uppercase variant is the reference spelling of a character name. (See .) The same character may have different names in different definitions of character repertoires. Generally the name is intended to suggest a generic meaning and scope of use. But the standard warns (mentioning as an example of a character with varying usage):

    A character may have a broader range of use than the most literal interpretation of its name might indicate; coded representation, name, and representative glyph need to be taken in context when establishing the semantics of a character.

    When a character repertoire is defined (e.g. in a standard), some particular glyph is often used to describe the appearance of each character, but this should be taken as an example only. The standard specifically says (in section 3.2) that great variation is allowed between "representative glyph" appearing in the standard and a glyph used for the corresponding character:

    Consistency with the representative glyph does not require that the images be identical or even graphically similar; rather, it means that both images are generally recognized to be representations of the same character. Representing the character U+0061 Latin small letter a by the glyph "X" would violate its character identity.

    Thus, the definition of a repertoire is not a matter of just listing glyphs, but neither is it a matter of defining exactly the meanings of characters. It's actually an exception rather than a rule that a character repertoire definition explicitly says something about the meaning and use of a character.

    (e.g. being classified as a letter or having numeric value in the sense that digits have) are defined, as in the , but such properties are rather general in nature.

    This vagueness may sound irritating, and it often is. But an essential point to be noted is that quite a lot of information is implied. You are expected to deduce what the character is, using both the character name and its representative glyph, and perhaps context too, like the grouping of characters under different headings like "currency symbols".

    For more information on the glyph concept, see the document (ISO/IEC TR 15285:1998) and Apple's document Characters, Glyphs, and Related Terms



    阅读(1415) | 评论(0) | 转发(0) |
    给主人留下些什么吧!~~