Chinaunix首页 | 论坛 | 博客
  • 博客访问: 352796
  • 博文数量: 89
  • 博客积分: 2919
  • 博客等级: 少校
  • 技术积分: 951
  • 用 户 组: 普通用户
  • 注册时间: 2006-05-23 11:51
个人简介

好懒,什么都没写

文章分类

全部博文(89)

文章存档

2012年(3)

2011年(17)

2007年(20)

2006年(49)

我的朋友

分类:

2007-08-27 15:03:07

字符集,字符的码,编码方式 --一直没有搞清楚它们之间的区别和联系。最近工作做连续遇到这方面的困扰,终于决心,把它们搞清楚了~~!
原文地址:~jkorpela/chars.html
如果你看明白来,不妨为浏览器做个编码自动识别程序~~!Mozilla的对应程序地址为:

A repertoire of s comprises a font. In a more technical sense, as the implementation of a font, a font is a numbered set of glyphs. The numbers correspond to of the characters (presented by the glyphs). Thus, a font in that sense is character code dependent. An expression like "Unicode font" refers to such issues and does not imply that the font contains glyphs for all characters.

It is possible that a font which is used for the presentation of some character repertoire does not contain a different glyph for each character. For example, although characters such as Latin uppercase A, Cyrillic uppercase A, and Greek uppercase alpha are regarded as distinct characters (with distinct code values) in , a particular font might contain just one A which is used to present all of them. (For information about fonts, there is a very large , but it's rather old: last update in 1996. The is dated, too.)

Characters with quite different purposes and meanings may well look similar, or almost similar, in some s at least. Using a character as a surrogate for another for the sake of apparent similarity may lead to great confusion. Consider, for example, the so-called sharp s (es-zed), which is used in the German language. Some people who have noticed such a character in the repertoire have thought "vow, here we have the beta character!". In many fonts, the sharp s (ß) really looks more or less like the Greek lowercase beta character (β). But it must not be used as a surrogate for beta. You wouldn't get very far with it, really; what's the big idea of having beta without alpha and all the other Greek letters? More seriously, the use of sharp s in place of beta would confuse text searches, spelling checkers, speech synthesizers, indexers, etc.; an automatic converter might well turn sharp s into ss; and some font might present sharp s in a manner which is very different from beta.

For some more explanations on this, see in .

The identity of characters is defined by the of a . Thus, it is not an absolute concept but relative to the repertoire; some repertoire might contain a character with mixed usage while another defines distinct characters for the different uses. For instance, the repertoire has a character called hyphen. It is also used as a minus sign (as well as a substitute for a dash, since ASCII contains no dashes). Thus, that ASCII character is a generic, multipurpose character, and one can say that in ASCII hyphen and minus are identical. But in , there are distinct characters named "hyphen" and "minus sign" (as well as different dash characters). For compatibility, the old ASCII character is preserved in Unicode, too (in the old code position, with the name ).

defines characters for , , etc., as distinct from the Greek letters (small mu, capital pi, etc.) they originate from. This is a logical distinction and does not necessarily imply that different glyphs are used. The distinction is important e.g. when textual data in digital form is processed by a program (which "sees" the code values, through some encoding, and not the glyphs at all). Notice that Unicode does not make any distinction e.g. between the  (π), and the mathematical symbol pi denoting the well-known constant 3.14159... (i.e. there is no separate symbol for the latter). For the  (Ω), there is a specific character (in the Symbols Area), but it is defined as being canonical equivalent to  (Ω), i.e. there are two separate characters but they are equivalent). On the other hand, it makes a distinction between  (Π) and the mathematical symbol  (∏), so that they are not equivalents.

If you think this doesn't sound quite logical, you are not the only one to think so. But the point is that for symbols resembling Greek letter and used in various contexts, there are three possibilities in Unicode:

  • the symbol is regarded as identical to the Greek letter (just as its particular usage)
  • the symbol is included as a separate character but only for compatibility and as compatibility equivalent to the Greek letter
  • the symbol is regarded as a completely separate character.

You need to check the for information about each individual symbol. Note in particular that a query to Indrek Hein's will give such information in the decomposition info part (but only in the entries for compatibility characters!). As a rough rule of thumb about symbols looking like Greek letters, mathematical operators (like summation) exist as independent characters whereas symbols of quantities and units (like pi and ohm) are equivalent or identical to Greek letters.

In addition to the fact that , it is quite possible that some program fails to display a character at all. Perhaps the program cannot interpret a particular way in which the character is presented. The reason might simply be that some had been used to denote the character and a different program is in use now. (This happens quite often even if "the same" program is used; for example, Internet Explorer version 4.0 is able to recognize α as denoting the Greek letter alpha (α) but IE 3.0 is not and displays the notation literally.) And naturally it often occurs that a program does not recognize the basic of the data, either because it was not properly informed about the encoding according to which the data should be interpreted or because it has not been programmed to handle the particular encoding in use.

But even if a program recognizes some data as denoting a character, it may well be unable to display it since it lacks a for it. Often it will help if the user manually checks the settings, perhaps manually trying to find a rich enough font. (Advanced programs could be expected to do this automatically and even to pick up glyphs from different fonts, but such expectations are mostly unrealistic at present.) But it's quite possible that no such font can be found. As an important detail, the possibility of seeing e.g. Greek characters on some Windows systems depends on whether "internationalization support" has been installed.

A well-design program will in some appropriate way indicate its inability to display a character. For example, a small rectangular box, the size of a character, could be used to indicate that there is a character which was recognized but cannot be displayed. Some programs use a question mark, but this is risky - how is the reader expected to distinguish such usage from the real "?" character?

Although several character , most notably that of , contain mathematical and other symbols, the presentation of mathematical formulas is essentially not a character level problem. At the character level, symbols like integration or n-ary summation can be defined and their and defined, and representative shown, and perhaps some usage notes given. But the construction of real formulas, e.g. for a definite integral of a function, is a different thing, no matter whether one considers formulas abstractly (how the structure of the formula is given) or presentationally (how the formula is displayed on paper or on screen). To mention just a few approaches to such issues, the system is widely used by mathematicians to produce high-quality presentations of formulas, and is an ambitious project for creating a markup language for mathematics so that both structure and presentation can be handled.

Other structural or presentational aspects, such as variation, are to be handled separately. However, there are characters which would now be considered as differing in font only but for historical reasons regarded as distinct.

There is a large number of compatibility characters in which are variants of other characters. They were included for compatibility with other standards so that data presented using some other can be converted to ISO 10646 and back without losing information. The Unicode standard says (in section 2.4):

Compatibility characters are those that would not have been encoded except for compatibility and round-trip convertibility with other standards. They are variants of characters that already have encodings as normal (that is, non-compatibility) characters in the Unicode Standard.

There is a large number of compatibility characters in the but also scattered around the Unicode space.

Many, but not all, compatibility characters have compatibility decompositions. The contains, for each character, a field (the sixth one) which specifies its eventual compatibility decomposition.

Thus, to take a simple example, is an character with its own code position in that standard. In ISO 10646 way of thinking, it would have been treated as just a superscript variant of digit two. But since the character is contained in an important standard, it was included into ISO 10646, though only as a "compatibility character". The practical reason is that now one can convert from ISO Latin 1 to ISO 10646 and back and get the original data. This does not mean that in the ISO 10646 philosophy superscripting (or subscripting, italics, bolding etc.) would be irrelevant; rather, they are to be handled at another level of data presentation, such as some special .

There is a document titled and produced jointly by the World Wide Web Consortium () and the Unicode Consortium. It discusses, among other things, : should they be used, or should the corresponding non-compatibility characters be used, perhaps with some markup and/or style sheet that corresponds to the difference between them. The answers depend on the nature of the characters and the available markup and styling techniques. For example, for superscripts, the use of sup markup (as in HTML) is recommended, i.e. 2 is preferred over sup2; This is a debatable issue; see my .

The definition of Unicode indicates our sample character, , as a compatibility character with the compatibility decomposition " + 0032 2". Here "" is a semi-formal way of referring to what is considered as typographic variation, in this case superscript style, and "0032 2" shows the hexadecimal code of a character and the character itself.

have compatibility decompositions consisting of several characters. Due to this property, they can be said to represent ligatures in the broad sense. For example, latin small ligature fi () has the obvious decomposition consisting of letters "f" and "i". It is still a distinct character in Unicode, but in the spirit of , we should not use it except for storing and transmitting existing data which contains that character. Generally, ligature issues should be handled outside the character level, e.g. selected automatically by a formatting program or indicated using some suitable .

Note that the word ligature can be misleading when it appears in a character name. In particular, the old name of the character "æ", latin small letter ae (), is latin small ligature ae, but it is not a ligature of "a" and "e" in the sense described above. It has no compatibility decomposition.

In section 1.15 Ligatures, the term ligature is defined as follows:

A ligature occurs where two or more letterforms are written or printed as a unit. Generally, ligatures replace characters that occur next to each other when they share common components. Ligatures are a subset of a more general class of figures called "contextual forms."

, i.e. an additional graphic such as an accent or cedilla attached to a character, can be treated in different ways when defining a character repertoire. See some on this in my . It also explains why the so-called spacing diacritic marks are of very limited usefulness, except when taken into some secondary usage.

In the approach, there are separate characters called . The general idea is that you can express a vast set of characters with diacritics by representing them so that a base character is followed by one or more (!) combining (non-spacing) diacritic marks. And a program which displays such a construct is expected to do rather clever things in formatting, e.g. selecting a particular shape for the diacritic according to the shape of the base character. This requires Unicode support at  3. Most programs currently in use are totally incapable of doing anything meaningful with combining diacritic marks. But there is some simple support to them in Internet Explorer for example, though you would need a font which contains the combining diacritics (such as Arial Unicode MS); then IE can handle simple combinations reasonably. See in . Regarding advanced implementation of the rendering of characters with diacritic marks, consult Unicode Technical Note #2, .

Using combining diacritic marks, we have wide range of possibilities. We can put, say, a diaeresis on a gamma, although "Greek small letter gamma with diaeresis" does not exist as a character. The combination U+03B3 U+0308 consists of two characters, although its visual presentation looks like a single character in the same sense as "ä" looks like a single character. This is how your browser displays the combination: "γ̈". In most browsing situations at present, it probably isn't displayed correctly; you might see e.g. the letter gamma followed by a box that indicates a missing glyph, or you might see gamma followed by a diaeresis shown separately (¨).

Thus, in practical terms, in order to use a character with a diacritic mark, you should primarily try to find it as a precomposed character. A precomposed character, also called composite character or decomposable character, is one that has a (and thereby ) of its own but is in some sense equivalent to a sequence of other characters. There are lots of them in Unicode, and they cover the needs of most (but not all) languages of the world, but not e.g. the presentation of the by which, in its general form, requires several different diacritic marks. For example, the character latin small letter a with diaeresis (, ä) is, by Unicode definition, decomposable to the sequence of the two characters latin small letter a () and combining diaeresis (). This is at present mostly a theoretic possibility. Generally by decomposing all decomposable characters one could in many cases simplify the processing of textual data (and the resulting data might be converted back to a format using precomposed characters). See e.g. the working draft .

Typing characters on a computer may appear deceptively simple: you press a key labeled "A", and the character "A" appears on the screen. Well, you actually get uppercase "A" or lowercase "a" depending on whether you used the shift key or not, but that's common knowledge. You also expect "A" to be included into a disk file when you save what you are typing, you expect "A" to appear on paper if you print your text, and you expect "A" to be sent if you send your product by E-mail or something like that. And you expect the recipient to see an "A".

Thus far, you should have learned that the presentation of a character in computer storage or disk or in data transfer may vary a lot. You have probably realized that especially if it's not the common "A" but something more special (say, an "A" with an accent), strange things might happen, especially if data is not accompanied with adequate .

But you might still be too confident. You probably expect that on your system at least things are simpler than that. If you use your very own very personal computer and press the key labeled "A" on its keyboard, then shouldn't it be evident that in its storage and processor, on its disk, on its screen it's invariably "A"? Can't you just ignore its internal character code and character encoding? Well, probably yes - with "A". I wouldn't be so sure about "Ä", for instance. (On Windows systems, for example, DOS mode programs differ from genuine Windows programs in this respect; they use a .)

When you press a key on your , then what actually happens is this. The keyboard sends the code of a character to the processor. The processor then, in addition to storing the data internally somewhere, normally sends it to the display device. (For more details on this, as regards to one common situation, see in .) Now, the keyboard settings and the display settings might be different from what you expect. Even if a key is labeled "Ä", it might send something else than the code of "Ä" in the character code used in your computer. Similarly, the display device, upon receiving such a code, might be set to display something different. Such mismatches are usually undesirable, but they are definitely possible.

If your computer uses internally, say, character repertoire, you probably won't find keys for all 191 characters in it on your keyboard. And for , it would be quite impossible to have a key for each character! Different keyboards are used, often according to the needs of particular languages. For example, keyboards used in Sweden often have a key for the character but seldom a key for ; in Spain the opposite is true. Quite often some keys have multiple uses via various keys, as explained below. For an illustration of the variation, as well as to see what layout might be used in some environments, see

  • at (contains some errors)
  • by
  • at
  • documented by ; contains several layouts for "exotic" languages too
  • The interactive Windows Layouts page by ; requires Internet Explorer with JavaScript enabled. (Actually, using it I found out new features in the Finnish keyboard I have: I can use Alt Gr m to produce the micro sign µ, although there is no hint about this in the "m" key itself.)

In several systems, including MS Windows, it is possible to switch between different keyboard settings. This means that the effects of different keys do not necessarily correspond to the engravings in the key caps but to some other assignments. To ease typing in such situations, "virtual keyboards" can be used. This means that an image of a keyboard is visible on the screen, letting the user type characters by clicking on keys in it or using the information to see the current assignments of the keys of the physical keyboard. For the Office software on Windows systems, there is a free add-in available for this: Microsoft Visual Keyboard.

Thus, you often need program-specific ways of entering characters from a keyboard, either because there is no key for a character you need or there is but it does not work (properly). The program involved might be part of system software, or it might be an application program. Three important examples of such ways:

  • you can (usually - some application programs may override this) produce any character in the (naturally, in its Windows encoding) as follows: Press down the (left) and keep it down. Then type, using the separate (not the numbers above the letter keys!), the four-digit code of the character in decimal. Finally release the Alt key. Notice that the first digit is always 0, since the code values are in the range 32 - 255 (decimal). For instance, to produce the letter "Ä" (which has code 196 in decimal), you would press Alt down, type 0196 and then release Alt. Upon releasing Alt, the character should appear on the screen. In MS Word, the method works only if Num Lock is set. This method is often referred to as Alt-0nnn. (If you omit the leading zero, i.e. use Alt-nnn, the effect is different, since that way you insert the character in code position nnn in the ! For example, Alt-196 would probably insert a graphic character which looks somewhat like a hyphen. There are variations in the behavior of various Windows programs in this area, and using those DOS codes is best avoided.)
  • editor (which is popular especially on Unix systems), you can produce any character by typing first control-Q, then its code as a three-digit number. To produce "Ä", you would thus type control-Q followed by the three digits 304 (and expect the "Ä" character to appear on screen). This method is often referred to as C-Q-nnn. (There are , too.)
  • often modify user input e.g. so that when you have typed the three characters "(", "c", and ")", the program changes, both internally and visibly, that string to the single character "©". This is often convenient, especially if you can add your own rules for modifications, but it causes unpleasant surprises and problems when you actually meant what you wrote, e.g. wanted to write letter "c" in parentheses.
  • , typically involving the use of an key or some other "composition key", by converting them to special characters. In fact, even the well-known shift key is a composition key: it is used to modify the meaning of another key, e.g. by changing a letter to uppercase or turning a digit key to a special character key. Such things are not just "program-specific"; they also depend on the program version and settings (and on the keyboard, of course), and could well be user-modifiable. For example, in order to support the , various methods have been developed, e.g. by Microsoft so that pressing the "e" key while keeping the Alt Gr key pressed down might produce the euro sign - in some ! But this may require a special "euro update", and the key combinations vary even when we consider Microsoft products only. So it would be quite inappropriate to say e.g. "to type the euro, use AltGr+e" as general, unqualified advice.

mentioned above are not present on all keyboards, and often they both carry the text "Alt" but they can be functionally different! Typically, those keys are on the left and on the right of the space bar. It depends on the physical keyboard what the key cap texts are, and it depends on the keyboard settings whether the keys have the same effect or different effects. The name "Alt Gr" for "right Alt" is short for "alternate graphic", and it's mostly used to create additional characters, whereas (left) "Alt" is typically used for keyboard access to menus.

The last method above could often be called "device dependent" rather than program specific, since the program that performs the conversion might be a keyboard . In that case, normal programs would have all their input from the keyboard processed that way. This method may also involve the use of auxiliary keys for typing characters with such as "á". Such an auxiliary key is often called dead key, since just pressing it causes nothing; it works only in combination with some other key. A more official name for a dead key is modifier key. For example, depending on the keyboard and the driver, you might be able to produce "á" by pressing first a key labeled with the acute accent (´), then the "a" key.

My keyboard has two keys for such purposes. There's the accent key, with the acute accent and the grave accent (`) as "upper case" character, meaning I need to use the for the grave. And there's a key with the dieresis (¨) and the circumflex (^) above it (i.e. as "upper case") and the tilde (~) below or left to it (meaning I need to use Alt Gr for it), so I can produce characters with those diacritics. Note that this does not involve any operation on the characters ´`¨^~, and the keyboard does not send those characters at all in such situations. If I try to enter that way a character outside the ISO Latin 1 repertoire, I get just the diacritic as a separate character followed by the normal character, e.g. "^j". To enter the diacritic itself, such as the , I may need to press the space bar so that the tilde diacritic combines with the blank (producing ~) instead of a letter (producing e.g. "ã"). Your situation may well be different, in part or entirely. For example, a typical French keyboard has separate keys for those accented letters that are used in French (e.g. "à"), but the accents themselves can be difficult to produce. You might need to type AltGr è followed by a space to produce the grave accent `.

It is often possible to use various "escape" notations for characters. This rather vague term means notations which are afterwards converted to (or just displayed as) characters according to some specific rules by some programs. They depend on the markup, programming, or other language (in a broad but technical meaning for "language", so that data formats can be included but human languages are excluded). If different languages have similar conventions in this respect, a language designer may have picked up a notation from an existing language, or it might be a coincidence.

The phrase "escape notations" or even "escapes" for short is rather widespread, and it reflects the general idea of escaping from the limitations of a character repertoire or device or protocol or something else. So it's used here, although a name like meta notations might be better. It is any case essential to distinguish these notations from the use of the ESC (escape) in and other character codes.

Examples:

  • In the language, characters have names, such as Adieresis for , which can be used to denote them according to certain rules.
  • In the data format, the notation \'c4 is used to denote .
  • In systems, there are different ways of producing characters, possibly depending on the "packages" used. Examples of ways to produce : \"A, \symbol{196}, \char'0304, \capitaldieresis{A} (for a large list, consult
  • the one can use the notation Ä for . In the official HTML terminology, such notations are called . It depends on HTML version which entities are defined, and it depends on a browser .
  • for . Generally, in any based system, or "SGML application" as the jargon goes, a numeric character reference (or, actually, just ) of the form &#number; can be used, and it refers to the character which is in code position n in the defined for the "SGML application" in question. This is actually very simple: you specify a character by its index (position, number). But in SGML terminology, the character code which determines the interpretation of &#number; is called, quite confusingly, the document character set. For HTML, the "document character set" is (or, to be exact, a subset thereof, depending on HTML version). A most essential point is that for HTML, the "document character set" is completely independent of the of the document! (See 's .) The so-called like Ä in HTML can be regarded as symbolic names defined for some numeric character references. In XML, character references use ISO 10646 by language definition. Although both entity and character references are markup, to be used in markup languages, they often replaced by the corresponding characters, when a user types text on an Internet discussion forum. This might be a conscious decision by the forum designer, but quite often it is caused unintentionally.
  • In , you can present a character as "\n , where n is the Unicode code position in hexadecimal.
  • In the , one can usually write \0304 to denote Ä within a string constant, although this makes the program character code dependent.

As you can see, the notations typically involve some (semi-)mnemonic name or the of the character, in some . (The code number for our example character is 196 in decimal, 304 in octal, C4 in hexadecimal.) And there is some method of indicating that the letters or digits are not to be taken as such but as part of a special notation denoting a character. Often some specific character such as the is used as an "escape character". This implies that such a character cannot be used as such in the language or format but must itself be "escaped"; for example, to include the backslash itself into a string constant in C, you need to write it twice (\\).

In cases like these, the character itself does not occur in a file (such as an HTML document or a C source program). Instead, the file contains the "escape" notation as a character sequence, which will then be interpreted in a specific way by programs like a Web browser or a C compiler. One can in a sense regard the "escape notations" as used in specific contexts upon specific agreements.

For example, when sending E-mail one might use A" (letter A followed by a quotation mark) as a surrogate for Ä (letter A with dieresis), or one might use AE instead of Ä. The reader is assumed to understand that e.g. A" on display actually means Ä. Quite often the purpose is to use characters only, so that the typing, transmission, and display of the characters is "safe". But this typically means that text becomes very messy; the Finnish word Hämäläinen does not look too good or readable when written as Ha"ma"la"inen or Haemaelaeinen. Such usage is based on special (though often implicit) conventions and can cause a lot of confusion when there is no mutual agreement on the conventions, especially because there are so many of them. (For example, to denote letter a with acute accent, á, a convention might use the apostrophe, a', or the solidus, a/, or the acute accent, a´, or something else.)

Character Mnemonics & Character Sets, published as , which lists a large number of "escape notations" for characters. They are very short, typically two characters, e.g. A: for Ä and th for þ (thorn). Naturally there's the problem that the reader must know whether e.g. th is to be understood that way or as two letters t and h. So the system is primarily for referring to characters (see below), but under suitable circumstances it could also be used for actually writing texts, when the ambiguities can somehow be removed by additional conventions or by context. RFC 1345 cannot be regarded as official or widely known, but if you need, for some applications, an "escape scheme", you might consider using those notations instead of reinventing the wheel.

There are also various ways to identify a character when it cannot be used as such or when the appearance of a character is not sufficient identification. This might be regarded as a variant of the discussed above, but the pragmatic view is different here. We are not primarily interested in using characters in running text but in specifying which character is being discussed.

For example, when discussing the (and may have an identical or very similar , and is transliterated as E according to ), there are various options:

  • "Cyrillic E"; this is probably intuitively understandable in this case, and can be seen as referring either to the similarity of shape or to the transliteration equivalence; but in the general case these interpretations do not coincide, and the method is otherwise vague too
  • "U+0415"; this is a unique identification but requires the reader to know the idea of
  • "cyrillic capital letter ie" (using the official Unicode ) or "cyrillic IE" (using an abridged version); one problem with this is that the names can be long even if simplified, and they still cannot be assumed to be universally known even by people who recognize the character
  • "KE02", which uses the special notation system defined in ; the system uses a compact notation and is marginally mnemonic (K = kirillica 'Cyrillics'; the numeric codes indicate small/capital letter variation and the use of )
  • any of the discussed above, such as "E=" by or "Е" in HTML; this can be quite adequate in a context where the reader can be assumed to be familiar with the particular notation.

It is hopefully obvious from the preceding discussion that a sequence of can be interpreted in a multitude of ways when processed as character data. By looking at the octet sequence only, you cannot even know whether each octet presents one character or just part of a two-octet presentation of a character, or something more complicated. Sometimes one can guess the encoding, but data processing and transfer shouldn't be guesswork.

Naturally, a sequence of octets could be intended to present other than character data, too. It could be an image in a bitmap format, or a computer program in binary form, or numeric data in the internal format used in computers.

This problem can be handled in different ways in different systems when data is stored and processed within one computer system. For data transmission, a platform-independent method of specifying the general format and the encoding and other relevant information is needed. Such methods exist, although they not always used widely enough. People still send each other data without specifying the encoding, and this may cause a lot of harm. Attaching a human-readable note, such as a few words of explanation in an E-mail message body, is better than nothing. But since data is processed by programs which cannot understand such notes, the encoding should be specified in a standardized computer-readable form.

Internet media types, often called MIME media types, can be used to specify a major media type ("top level media type", such as text), a subtype (such as html), and an encoding (such as ). They were originally developed to allow sending other than plain data by E-mail. They can be (and should be) used for specifying the encoding when data is sent over a network, e.g. by E-mail or using the protocol on the World Wide Web.

The media type concept is defined in . The procedure for registering types in given in ; according to it, the registry is kept by at ftp://ftp.isi.edu/in-notes/iana/assignments/media-types/ but it has in fact been moved to

The technical term used to denote a in the Internet media type context is "character set", abbreviated "charset". This has caused a lot of confusion, since "set" can easily be understood as !

Specifically, when data is sent in MIME format, the media type and encoding are specified in a manner illustrated by the following example:
Content-Type: text/html; charset=iso-8859-1
This specifies, in addition to saying that the media type is text and subtype is html, that the character encoding is .

with references to documents defining their meanings, is kept by at

(According to the documentation of the registration procedure, , it should be elsewhere, but it has been moved.) I have composed a , ordered alphabetically by "charset" name and accompanied with some hypertext references.

Several character encodings have alternate (alias) names in the registry. For example, the basic (ISO 646) variant of can be called "ASCII" or "ANSI_X3.4-1968" or "cp367" (plus a few other names); the preferred name in context is, according to the registry, "US-ASCII". Similarly, has several names, the preferred MIME name being "ISO-8859-1". The "native" encoding for Unicode, , is named "ISO-10646-UCS-2" there.

The Content-Type information is an example of information in a header. Headers relate to some data, describing its presentation and other things, but are passed as logically separate from it. Possible headers and their contents are defined in the basic MIME specification, RFC 2045. Adequate headers should normally be generated automatically by the software which sends the data (such as a program for sending E-mail, or a Web server) and interpreted automatically by receiving software (such as a program for reading E-mail, or a Web browser). In E-mail messages, headers precede the message body; it depends on the E-mail program whether and how it displays the headers. For Web documents, a Web server is required to send headers when it delivers a document to a browser (or other user agent) which has sent a request for the document.

In addition to media types and character encodings, MIME addresses several other aspects too. has composed the documentation , which contains the basic RFCs on MIME in hypertext format and for them.

defines, among many other things, the general purpose "Quoted-Printable" (QP) which can be used to present any sequence of as a sequence of such octets which correspond to characters. This implies that the sequence of octets becomes longer, and if it is read as an ASCII string, it can be incomprehensible to humans. But what is gained is robustness in data transfer, since the encoding uses only "safe" ASCII characters which will most probably get through any component in the transfer unmodified.

Basically, QP encoding means that most octets smaller than 128 are used as such, whereas larger octets and some of the small ones are presented as follows: octet n is presented as a sequence of three octets, corresponding to ASCII codes for the = sign and the two digits of the notation of n. If QP encoding is applied to a sequence of octets presenting character data according to character code, then effectively this means that most ASCII characters (including all ASCII letters) are preserved as such whereas e.g. the ISO 8859-1 character (code position 228 in decimal, E4 in hexadecimal) is encoded as =E4. (For obvious reasons, the equals sign = itself is among the few ASCII characters which are encoded. Being in code position 61 in decimal, 3D in hexadecimal, it is encoded as =3D.)

Notice that encoding ISO 8859-1 data this way means that the character code is the one specified by the ISO 8859-1 standard, whereas the character encoding is different from the one specified (or at least suggested) in that standard. Since QP only specifies the mapping of a sequence of octets to another sequence of octets, it is a pure encoding and can be applied to any character data, or to any data for that matter.

Naturally, Quoted-Printable encoding needs to be processed by a program which knows it and can convert it to human-readable form. It looks rather confusing when displayed as such. Roughly speaking, one can expect most E-mail programs to be able to handle QP, but the same does not apply to newsreaders (or Web browsers). Therefore, you should normally use QP in E-mail only.

Basically, MIME should let people communicate smoothly without hindrances caused by character code and encoding differences. MIME should handle the necessary conversions automatically and invisibly.

For example, when person A sends E-mail to person B, the following should happen: The E-mail program used by A encodes A's message in some particular manner, probably according to some convention which is normal on the system where the program is used (such as encoding on a typical modern Unix system). The program automatically includes information about this encoding into an E-mail header, which is usually invisible both when sending and when reading the message. The message, with the headers, is then delivered, through network connections, to B's system. When B uses his E-mail program (which may be very different from A's) to read the message, the program should automatically pick up the information about the encoding as specified in a header and interpret the message body according to it. For example, if B is using a Macintosh computer, the program would automatically convert the message into and only then display it. Thus, if the message was encoded and contained the Ä (upper case A with dieresis) character, encoded as octet 196, the E-mail program used on the Mac should use a conversion table to map this to octet 128, which is the encoding for Ä on Mac. (If the program fails to do such a conversion, strange things will happen. characters would be displayed correctly, since they have the same codes in both encodings, but instead of Ä, the character corresponding to octet 196 in Mac encoding would appear - a symbol which looks like f in italics.)

Unfortunately, there are deficiencies and errors in software so that users often have to struggle with character code conversion problems, perhaps correcting the actions taken by programs. It takes two to tango, and some more participants to get characters right. This section demonstrates different things which may happen, and do happen, when just one component is faulty, i.e. when MIME is not used or is inadequately supported by some "partner" (software involved in entering, storing, transferring, and displaying character data).

Typical minor (!) problems which may occur in communication in Western European languages other than English is that most characters get interpreted and displayed correctly but some "national letters" don't. For example, character repertoire needed in German, Swedish, and Finnish is essentially plus a few letters like "ä" from the rest of . If a text in such a language is processed so that a necessary conversion is not applied, or an incorrect conversion is applied, the result might be that e.g. the word "später" becomes "spter" or "spÌter" or "spdter" or "sp=E4ter".

Sometimes you might be able to guess what has happened, and perhaps to determine which code conversion should be applied, and apply it more or less "by hand". To take an example (which may have some practical value in itself to people using languages mentioned) Assume that you have some text data which is expected to be, say, in German, Swedish or Finnish and which appears to be such text with some characters replaced by oddities in a somewhat systematic way. Locate some words which probably should contain but have something strange in place of it (see examples above). Assume further that the program you are using interprets text data according to by default and that the actual data is not accompanied with a suitable indication (like a Content-Type header) of the encoding, or such an indication is obviously in error. Now, looking at what appears instead of "ä", we might guess:

a
The person who wrote the text assumably just used "a" instead of "ä", probably because he thought that "ä" would not get through correctly. Although "ä" is surely problematic, the cure usually is worse than the disease: using "a" instead of "ä" loses information and may change the meanings of words. This usage, and the next two usages below, is (usually) not directly caused by incorrect implementations but by the human writer; however, it is indirectly caused by them.
ae
Similarly to the above-mentioned case, this is usually an attempt to avoid writing "ä". For some languages (e.g. German), using "ae" as a surrogate for "ä" works to some extent, but it is much less applicable to Swedish or Finnish - and loses information, since the letter pair "ae" can genuinely occur in many words.
a"
Yet another surrogate. It resembles an old (and generally outdated) but it is probably expected to be understood by humans instead of being converted to an "ä" by a program.
d
The original data was actually encoded or something similar (e.g. ) but during data transfer the most significant of each octet was lost. (Such things may happen in systems for transferring, or "gatewaying", data from one network to another. Sometimes it might be your terminal that has been configured to "mask out" the most significant bit!) This means that the octet representing "ä" in , i.e. 228, became 228 - 128 = 100, which is the ISO 8859-1 encoding of letter d.
{
Obviously, the data is in encoding so that the character "{" is used in place of "ä". Earlier it was common to use various , with characters #$@[\]^_`{|}~ replaced by national characters according to the needs of a particular language. Thus they modified the character repertoire of ASCII by dropping out some special characters and introducing national characters into their ASCII code positions. It requires further study to determine the actual encoding used, since e.g. Swedish, German and Finnish ASCII variants all have "ä" as a replacement for "{", but there are differences in other replacements.
ä
The data is evidently in encoding. Notice that the characters à and ¤ stand here for octets 195 and 164, which might be displayed differently depending on program and device used.
+AOQ-
The data is in encoding.
Ì
The data is most probably in encoding (defined by Hewlett-Packard).
=E4
The data is in encoding. The original encoding, upon which the QP encoding was applied, might be , or any other encoding which represents character "ä" in the same way as ISO 8859-1 (i.e. as octet 228 decimal, E4 hexadecimal).
ä
The data is in format; the encoding may vary. The notation ä is a so-called .
ä
The data is in format; the encoding may vary. The notation ä is a so-called . (Notice that 228 is the for ä in .)
‰ (per mille sign, 0/00)
This character occupies code position 228 in the . Thus, what has probably happened is that some program has received some ISO 8859-1 encoded data and interpreted it as if it were in Mac encoding, then performed some conversion based on that interpretation. Since is not an ISO 8859-1 character, your program is actually not applying ISO 8859-1 interpretation. Perhaps an erroneous conversion turned 228 into 137, which is the code position of the per mille sign in the . Windows programs usually interpret data according that code even when they are said to apply ISO 8859-1.
Σ (capital sigma)
This character occupies code position 228 in 437. Since is not an ISO 8859-1 character, your program is actually not applying ISO 8859-1 interpretation, for some reason. Perhaps it is interpreting the data according to DOS CP 437, or perhaps the data had been incorrectly converted to some encoding where sigma has a presentation.
nothing
Perhaps the data was encoded in (e.g. code page 850), where the code for "ä" is 132. In , octet 132 is in the area reserved for ; typically such octets are not displayed at all, or perhaps displayed as blank. If you can access the data in binary form, you could find evidence for this hypothesis by noticing that octets 132 actually appear there. (For instance, the editor would display such an octet as \204, since 204 is the notation for 132.) If, on the other hand, it's not octet 132 but octet 138, then the data is most probably in .
„ (double low-9 quotation mark)
Most probably the data was encoded in (e.g. code page 850), where the code for "ä" is 132. Your program is not actually interpreting the data as ISO 8859-1 encoded but according to the so-called , where this code position is occupied by the .
Š (capital S with caron)
Most probably the data was encoded in , where the code for "ä" is 138. Your program is not actually interpreting the data as ISO 8859-1 encoded but according to the so-called , where this code position is occupied by the .

To illustrate what may happen when text is sent in a grossly invalid form, consider the following example. I'm sending myself E-mail, using Netscape 4.0 (on Windows 95). In the mail composition window, I set the encoding to . The body of my message is simply
Tämä on testi.
(That's Finnish for 'This is a test'. The second and fourth character is letter a with umlaut.) Trying to read the mail on my Unix account, using the Pine E-mail program (popular among Unix users), I see the following (when in "full headers" mode; irrelevant headers omitted here):

X-Mailer: Mozilla 4.0 [en] (Win95; I)
MIME-Version: 1.0
To: jkorpela@cs.tut.fi
Subject: Test
X-Priority: 3 (Normal)
Content-Type: text/plain; charset=x-UNICODE-2-0-UTF-7
Content-Transfer-Encoding: 7bit

[The following text is in the "x-UNICODE-2-0-UTF-7" character set]
[Your display is set for the "ISO-8859-1" character set]
[Some characters may be displayed incorrectly]

T+O6Q- on testi.

Interesting, isn't it? I specifically requested encoding, but Netscape used UTF-7. And it did not include a correct header, since x-UNICODE-2-0-UTF-7 is not a . Even if the encoding had been a registered one, there would have been no guarantee that my E-mail program would have been able to handle the encoding. The example, "T+O6Q-" instead of "Tämä", illustrates what may happen when an octet sequence is interpreted according to another encoding than the intended one. In fact, it is difficult to say what Netscape was really doing, since it seems to encode incorrectly.

The "+" and "-" characters correspond to octets indicating a switch to "shifted encoding" and back from it. The shifted encoding is based on presenting values first as 16-bit binary integers, then regrouping the bits and presenting the resulting six- groups as octets according to a table specified in RFC 2045 in the section on Base64. See also RFC 2152.

Whenever text data is sent over a network, the sender and the recipient should have a joint agreement on the character encoding used. In the optimal case, this is handled by the software automatically, but in reality the users need to take some precautions.

Most importantly, make sure that any Internet-related software that you use to send data specifies the encoding correctly in suitable headers. There are two things involved: the header must be there and it must reflect the actual encoding used; and the encoding used must be one that is widely understood by the (potential) recipients' software. One must often make compromises as regards to the latter aim: you may need to use an encoding which is not yet widely supported to get your message through at all.

It is useful to find out how to make your Web browser, newsreader, and E-mail program so that you can display the encoding information for the page, article, or message you are reading. (For example, on Netscape use View Page Info; on News Xpress, use View Raw Format; on Pine, use h.)

If you use, say, Netscape to send E-mail or to post to Usenet news, make sure it sends the message in a reasonable form. In particular, or duplicate it by sending it both as plain text and as HTML (select plain text only). As regards to character encoding, make sure it is something widely understood, such as , some encoding, or , depending on how large character repertoire you need.

In particular, avoid sending data in a proprietary encoding (like the or a ) to a public network. At the very least, if you do that, make sure that the message heading specifies the encoding! There's nothing wrong with using such an encoding within a single computer or in data transfer between similar computers. But when sent to Internet, data should be converted to a more widely known encoding, by the sending program. If you cannot find a way to configure your program to do that, get another program.

As regards to other forms of transfer of data in digital form, such as diskettes, information about encoding is important, too. The problem is typically handled by guesswork. Often the crucial thing is to know which program was used to generate the data, since the text data might be inside a file in, say, the MS Word format which can only be read by (a suitable version of) MS Word or by a program which knows its internal data format. That format, once recognized, might contain information which specifies the character encoding used in the text data included; or it might not, in which case one has to ask the sender, or make a guess, or use trial and error - viewing the data using different encodings until something sensible appears.

  • by Joel on Software. An enjoyable nice treatise, though probably not quite the absolute minimum.
  • Character Encodings Concepts, adapted from a presentation by Peter Edberg at a Unicode conference. Old, but a rich source of information, with good illustrations.
  • by . Partly a character set tutorial, partly a discussion of specific (especially ISO 8859 and HTML related) issues in depth.
  • Section in the by (archive copy)
  • , by Diffuse. (archive copy)
  • 's , which has interesting entries like
  • by . A good discussion of the basic concepts and misconceptions.
  • - an old (1997) collection of annotated links to information on character codes, fonts, etc.
  • John Clews: ; an introduction to scripts and transliteration, so it's useful background information for character code issues.
  • , which contains a lot of links to detailed documents on character code issues, especially progress and proposals in standardization.
  • : Detailed information on many topics (including particular character codes).
  • Steven J. Searle:
  • : . A book on Chinese, Japanese, Korean & Vietnamese Computing. The book itself is not online, but some extracts are, e.g. the chapter.
  • by Indrek Hein at the . You can e.g. search for characters by name or code position, get lists of differences between some character sets, and get lists of characters needed for different languages.
  • is a free program by . It can be used to perform various character code conversions between a large number of encodings.

Character code problems are part of a topic called internationalization (jocularly abbreviated as i18n), rather misleadingly, because it mainly revolves around the problems of using various languages and writing systems (scripts). (Typically !) It includes difficult questions like text directionality (some languages are written right to left) and requirements to present the same character with different glyphs according to its context. See .

I originally started writing this document as a tutorial for HTML authors. Later I noticed that this general information is extensive enough to be put into a document of its own. As regards to HTML specific problems, the document summarizes what currently seems to be the best alternative in the general case.


Acknowledgements

I have learned a lot about character set issues from the following people (listed in an order which is roughly chronological by the start of their influence on my understanding of these things): , , , Roman Czyborra, , Erkki I. Kolehmainen. (But any errors in this document I souped up by myself.)



Trackback: http://tb.blog.csdn.net/TrackBack.aspx?PostId=1380626
阅读(1405) | 评论(0) | 转发(0) |
给主人留下些什么吧!~~