Chinaunix首页 | 论坛 | 博客
  • 博客访问: 4641
  • 博文数量: 2
  • 博客积分: 0
  • 博客等级: 民兵
  • 技术积分: 20
  • 用 户 组: 普通用户
  • 注册时间: 2016-03-16 17:41
文章分类
文章存档

2016年(2)

我的朋友
最近访客

分类: Python/Ruby

2016-04-05 17:14:07

Encoding and Decoding -- with the implementation in Python2/3

Text as numbers

Encoding refers to the process of representing information in some form. Huamn language is an encoding system by which we represent information in terms of sequences of lexical units, and those in terms of sound or gesture sequences. Written language is a derivative system of encoding by which those sequences of lexical units, sounds or gestures are represented in terms of the graphical symbols that make up some writing system.

In computer systems, we encode written language by representing the graphemes or other text elements of the writing system in terms of sequences of characters, units of textual information within some system for representing written texts. These characters are in turn represented within a computer in terms of the only means of representation the computer knows how to work with: binary numbers. A character set encoding (or character encoding) is such a system form for doing this.

Any character set encoding involves at least these two components: a set of characters and some system for representing these in terms of the processing units used within the computer. (Notice that these correspond to the two levels of representation described in the previous paragraph.) There is, of course, no predetermined way in which this is done. The ASCII standard is one system for doing this, but not the only way, wiht the result that a number stored in the computer's data can mean things depending upon the conventions being assumed.

Some industry standard legacy encodings

The ASCII standard was among the earliest encoding standards, and was minimally adequate for US English text. It was not minimally adequated for British English, however, let alone fully adequate for English-language publishing or for most any other language. Not surprisingly, it did not take long for new standards to proliferate. These have come from two sources: standards bodies, and independent software vendors.

Software vendors have often developed encoding standards to meet the needs of a particular product in relation to a particular market. For example, IBM developed codepage 437 for DOS, taking ASCII and adding characters needed for British English and some major European languages, as well as graphical characters such as line-drawing characters for DOS applications to use in creating use interface elements. There were other DOS codepages created to meet other market in which other languages or scripts were used. For example, codepage 852 for Russian and some other Eastern European languages that use Cyrillic script, etc.

Among personal computer vendors, Apple created various standards that differd from IBM and Microsoft standards in order to suit the distinctive graphical nature of the Macintosh product line. Similarly, as Microsoft began development of Windows, the needs of the graphical environment led the to develop new codepages. These are the familiar Windows codepages, such as codepage 1252, alternately known as "Western", "Latern 1" or "ANSI".

Most of these legacy encoding standards encode each character in terms of a single 8-bit processing unit, or byte. Not all hardware architectures have used an 8-bit byte, however; different architectures have used bytes that range anywhere from 6 bits to 36 bits. For vitually all characters encoding standards that affect personal computers, however, 8-bit bytes are the norm.

It is not always the case, however, that characters are encoded in terms of just a single 8-bit value. For example, Microsoft codepages for Chinese, Japanese and Korean use so-called double-byte encodings, which use a mixture of one- or two-byte sequence to represent a character. To illustrate, in codepage 950, used for Traditional Chinese, the byte value 0x73 by itself is used to represent LATIN SMALL LETTER S, but certain two-byte sequences that end in 0x73 can represent a different character; for example, the two-byte sequence 0xA4 0x73 is used to represent the Traditional Chinese character '山'.

Of course, with all these different encoding standards, it is not surprising that in many cases a given standard may offer support for certain characters that others do not. None of these legacy standards is comprehensive. It is also true in many cases that multiple encoding standards support the same character set but in incompatible ways.

Before leaving this discussion of industry standard encodings, it would probably be helpful for me to explain Windows codepages and the term ANSI. When Windows was being developed, the American National Standards Institute (ANSI) was in the process of drafting a standard that eventually became ISO 8859-1 "Latin 1". Microsoft created their codepage 1252 for Western European languages based on an early draft of the ANSI proposal, and began to refer to this as "the ANSI codepage". Codepage 1252 was finalised before ISO 8859-1 was finalised, however, and the two are nont the same: codepage 1252 is a superset of ISO 8859-1.

Later, apparently around the time of Windows 95 development, Microsoft began to use the term "ANSI" in a different sense to mean any of the Windows codepages, as opposed to Unicode. Therefore, currently in the context of Windows, the terms "ANSI text" or "ANSI codepage" should be understood to mean text that is encoded with any of the legacy 8-bit Windows codepages rather than Unicode. It really should not be used to mean specific codepage associated with the US version of Windows, which is codepage 1252.

Character set encoding model

A more complete model needed to describe character sets and encodings involves four different levels of representation: the abstract character repertoire, the coded character set, the character encoding form, and the character encoding scheme.

Abstract character repertoire (ACR)

An abstract character repertoire (ACR) is simply an unordered collection of characters to be encoded. In a given standard, the repertoire may be closed, meaning that it is fixed and cannot be added to, or it may be open, meaning that new characters can be added to it over time.

Coded character set (CCS)

The second level in the model is the coded character set (CCS). A CCS is merely a mapping from some repertoire to a set of unique numeric designators. Typically, these are integers, though in some standards they are ordered pairs of integers.

The numeric designator is known as a codepoint, and the combination of an abstract character and its codepoint is known as an encoded character. It is important to note that these codepoints are not tied to any representation in a computer. The codepoints are not bytes; they are simply integers (or pairs of integers). The range of possible codepoints is typically defined in a standard as being limited. The valid range of codepoints in an encoding standard is referred to as the codespace. The collection of encoded characters in a standard is referred to as a codepage.

Before going on to the third level, it is important to note that some industry standards opereate at the CCS level. They standardise a character inventory and perhaps a set of names and properties, but they do not standardise the encoded representation of these characters in the computer. This is the case, for example, with several standards used in the Far East, such as GB2312-80 (for Simplified Chinese), CNS 11643 (for Traditional Chinese), JIS X 0208 (for Japanese) and KS X 1001 (for Korean). These standards depend upon separate standards for encoded representation, which is where the next level in our model fits in.

Character encoding form (CEF)

The third level is the character encoding form (CEF). It is in this level that we begin to take into consideration actual representation in a computer. A CEF is a mapping from the codepoints in a CSS to sequences of value of a fixed data type. These values are known as code units. In pirnciple, the code units can be of any size: they might be seven-bit values, 8-bit values, 19-bit values, or whatever. The most commonly encountered sizes for code units are eight-, sixteen- and thirty-tow-bits, though other sizes are also used.

In some conext, a CEF applied to a particular coded character set is referred to as a codepage.

The mapping between codepoints in the CCS and code units in the CEF does not have to be one-to-one. In many cases, an encoding form may map one codepoint to a sequence of multiple code units. This occurs in so-called "double-byte" encodings, like Microsoft codepage 932 or codepage 950. An encoding form also does not have to map characters to code unit sequences of a consistent length. One thing is required of a CEF, however: the mapping for any given codepoint must be a unique code unit sequence.

Some coded character sets are generally used with only one encoding form. For all common legacy character sets other than those used for the Far East, the codespace fits within a single-byte range, and so the encoded representation can easily be made identical in value to the codepoint. Many would not have any incentive to look for another encoding form. Among East Asian standards, the Big Five character set (for Traditional Chinese) is generally encoded using the Big Five encoding.

Similarly, some encoding forms are used only with certain character sets, as is the case with Big Five encoding, or with the UTF-8 encoding form of Unicode.

On the other hand, some character sets are often encoded in various encoding forms. For example, the GB2312-80 character set can be encoded using the GBK encoding, using ISO 2022 encoding, or using EUC encoding. Also, some encoding forms have been applied to multiple character sets. For example, there are variants of EUC encoding that correspond to the GB 2312-80 character set, CNS 11643-1992, JIS X 0208, and several other character sets.

Character encoding scheme (CES)

The last level in the model is the character encoding scheme (CES). When 16-bit or 32-bit data units are brought into an 8-bit byte context, the data units can easily be split into 8-bit chunks since their size is an integer multiple of 8 bits. There are two logical possibilities for the order in which those chunks can be sequenced, however: little-endian, meaning the low-order byte comes first; and big-endian, meaning the high-order byte comes first. A character encoding scheme simply specifies which byte sequencing order is used for the given encoding form.

Implementation in Python

Encoding is the process of translating a string of characters into its raw bytes form, according to a desired encoding name. Decoding is the process of translating a raw string of bytes into its character string form, according to its encoding name. That is, we encode from string to raw bytes, and decode from raw bytes to string. To scripts, decoded strings are just characters in memory, but may be encoded into a variety of byte string representations when stored on files, transferred over networks, embedded in documents and databases, and so on.

In memory, Python always stores decoded text strings in an encoding-neutral format, which may or may not use multiple bytes for each character. Through Python 3.2, string are stored internally in fixed-length UTF-16 (roughly, UCS-2) format with 2 bytes per character, unless Python is configured to use 4 bytes per character (UCS-4). Python 3.3 and later instead use a variable-length scheme with 1, 2 or 4 bytes per character, depending on a string's content. The size is chosen based upon the character with the largest Unicode ordinal value in the represented string. This scheme allows a space-efficient representation in common case, but also allows for full UCS-4 on all platforms.

The key point here, though, is that encoding pertains mostly to files and transfers. Once loaded into a Python string, text in memory has no notion of an "encoding", and is simple a sequence of Unicode characters (a.k.a. code points) sotred generically. In your script, that string is accessed as a Python string object.

Formally, to code non-ASCII characters, we can use:

  • Hex or Unicode escapes to embed Unicode code point ordinal values in text strings -- nomral string literals in 3.X, and Unicode string literals in 2.X (and in 3.3 for compatibility).
  • Hex escapes to embed the encoded representation of characters in byte strings -- normal string literals in 2.X, and bytes string literals in 3.X (and in 2.X for compatibility).

Note that text strings embed actual code point values, while byte strings embed their encoded form (code unit values). The value of a character's encoded representation in a byte string is the same as its decoded Unicode code point value in a text string for only certain characters and encoding (such as ASCII, Latin 1). In any event, hex escapes are limited to coding a single byte's value, but Unicode escapes can name characters with values 2 and 4 bytes wide. The chr function can also be used to create a single non-ASCII character from its code point value, and source code declarations apply to such characters embeded in your script. One the other hand, the ord function turns a Unicode code point value to the corresponding character, without any involvement of the encoded format.

阅读(526) | 评论(0) | 转发(0) |
0

上一篇:没有了

下一篇:Beautiful Soup 4 Documentation

给主人留下些什么吧!~~