rfc2045, internet message format中的quoted-printable编码
部分内容摘自:
SMTP只支持7bit ascii编码(当然增加了8BITMIME的支持后可支持8bit数据),因此我们的content-transfer encoding里就因该提供将8bit数据转换为7bit的方法,quoted-printable就是其中之一,还有base64. 如下是他的一些精髓:
例如在邮件中,会出现如下header:
Content-type: text/plain; charset=ISO-8859-1
Content-transfer-encoding: base64
其含义是:这个邮件是text类型的邮件,原文的编码是ISO-8859-1的,这是一个8bit的字符集,在传输时,采用了base64的编码,将其转换成了7bit的编码,进行decode之后还应该得到ISO-8859-1字符集的原文。
Content-transfer-encoding的可能的值有:
1. 7bit:- up to 998 octets per line of the code range 1..127 with CR and LF
(codes 13 and 10 respectively) only allowed to appear as part of a CRLF
line ending. This is the default value.
2. 8bit:- up to 998 octets per line with CR and LF (codes 13 and 10 respectively) only allowed to appear as part of a CRLF line ending.
Suitable only for use with SMTP servers that support the BINARYMIME SMTP extension ():
3. binary:— any sequence of octets. 与8bit的区别是没有998个字节一行的最大限制。
4. base64:— used to encode arbitrary octet sequences into a form that satisfies
the rules of 7bit. Designed to be efficient for non-text 8 bit data.
Sometimes used for text data that frequently uses non-US-ASCII
characters
5. quoted-printable:— used to encode arbitrary octet sequences into a form that satisfies
the rules of 7bit. Designed to be efficient and mostly human readable
when used for text data consisting primarily of US-ASCII characters but
also containing a small proportion of bytes with values outside that
range.
注意:Note that '7bit', '8bit', and 'binary' mean that no binary-to-text
encoding on top of the original encoding was used. In these cases, the
header is actually redundant for the email client to decode the message
body, but it may still be useful as an indicator of what type of object
is being sent. Values '' and ''
tell the email client that a binary-to-text encoding scheme was used
and that appropriate initial decoding is necessary before the message
can be read with its original encoding (e.g. UTF-8).
其中,1,2,3,都表明邮件中的数据都没有经过编码,他们本来就是7bit,8bit,binary的。
(1)用“=”后接两个16进制大写的数字字符来表示一个字节,如,用=3D来表示ascii值为61的字节。
All characters except printable ASCII characters or end of line characters must be encoded in this fashion. CRLF不能这样表示。等号本身,即61用=3D表示。
(2)All printable ASCII characters (decimal values between 33 and 126) may be represented by themselves, except "=" (decimal 61). 所有除了=在内的可打印ascii字符,可以用它们自己表示,即不进行编码。当然是不是编码成(1)的格式也是合法的呢?
(3)
and space characters, decimal values 9 and 32, may be represented by
themselves, except if these characters appear at the end of a line. If
one of these characters appears at the end of a line it must be encoded
as "=09" (tab) or "=20" (space). tab, 空格键当出现在行尾时必须使用(1)中的格式编码,如果没在行尾,可以不对其编码。
(4)If the data being encoded contains meaningful line breaks, they must be
encoded as an ASCII CR LF sequence, not as their original byte values.
Conversely if byte values 10 and 13 have meanings other than end of
line then they must be encoded as =0A and =0D. 如果数据中含有换行(不同环境的换行是不一样的),那么应该将其编码为ascii中的CRLF序列,(到底是编码为CRLF字符还是=0D=0A这样的序列呢?有可能是前者)如果原文中出现的值等于10或者13的字符并不是用来做换行的,那么应将其编码为=0A或者=0D。
(5)Lines of quoted-printable encoded data must not be longer than 76
characters. To satisfy this requirement without altering the encoded
text, soft line breaks may be added as desired. A soft line
break consists of an "=" at the end of an encoded line, and does not
cause a line break in the decoded text.
以行不能超过76字节,否则会自动加上软换行,这个软换行是我们自己加上去的,在解码的时候要去掉的,其格式是每行的末尾带一个=然后就是换行。
阅读(1647) | 评论(0) | 转发(0) |