Python Unicode: Encode and Decode Strings-laoliulaoliu-ChinaUnix博客

miraclemiracle.blog.chinaunix.net

首页　| 　博文目录　| 　关于我

laoliulaoliu

博客访问： 4663899
博文数量： 1214
博客积分： 13195
博客等级：上将
技术积分： 9105
用户组：普通用户
注册时间： 2007-01-19 14:41

个人简介

C++,python,热爱算法和机器学习

文章分类

全部博文（1214）

cloud（3）
operation（9）
tornado（4）
mac_os（1）
golang（4）
架构（13）
git（4）
security（29）
shell（1）
macbook（1）
ruby（13）
javascript（15）
design（3）
testing（1）
mac（1）
bigdata（69）
nosql（46）
R（9）
gcj/acm（6）
NLP（10）
小说（3）
matlab（4）
web（44）
java（66）
product（7）
c#（1）
language（4）
machine learning（76）
science（4）
opencourse（2）
windows（3）
search（33）
algorithm（65）
database（51）
compiler（11）
ACE（5）
poem（1）
programming（29）
python（140）
assembly（1）
linux（49）
C++（16）
book（2）
cate（1）
phliosophy（3）
mental（30）
Science fiction（1）
Software（5）
c（23）
network（65）
CS（15）
thinking（10）
BSD（13）
solaris10（2）
life（57）
Debian（16）
economy（7）
Mathematics（57）
OS（8）
ibm（2）
gentoo（32）
未分配的博文（8）

文章存档

2021年（13）

2020年（49）

2019年（14）

2018年（27）

2017年（69）

2016年（100）

2015年（106）

2014年（240）

2013年（5）

2012年（193）

2011年（155）

2010年（93）

2009年（62）

2008年（51）

2007年（37）

我的朋友

Python Unicode: Overview

In order to figure out what “encoding” and “decoding” is all about, let’s look at an example string:

		
								1
							
									>>> s = "Flügel"

We can see our string s has a non-ASCII character in it, namely “ü” or “umlaut-u.”. Assuming we’re in the standard Python 2.x interactive mode, let’s see what happens when we reference the string, and when it’s printed:

				
										1
									
										2
									
										3
									
										4
									
											>>> s 
										
											'Fl\xfcgel' 
										
											>>> print(s) 
										
											Flügel

Printing gave us the value that we assigned to the variable, but something obviously happened along the way that turned it from what we typed into the interpreter to something seemingly incomprehensible. The non-ASCII character ü was translated into a code phrase, i.e. “\xfc,“ by a set of rules behind-the-scenes. In other words, it was encoded.

At this point, s is an 8-bit string, which to us basically means it isn’t a Unicode string. Let’s examine how to make a Unicode string with the same data. The simplest way is with a “u” prefix in front of the literal string marking it as a Unicode string:

					
											1
										
												u = u"Flügel"

If we reference and print u like we did with s, we’ll find something similar:

			
									1
								
									2
								
									3
								
									4
								
										>>> u 
									
										u'Fl\xfcgel' 
									
										>>> print(u) 
									
										Flügel

We can see that the code phrase for our “umlaut-u” is still “\xfc“ and it prints the same—so does that mean our Unicode string is encoded the same way as our 8-bit string s? To figure that out let’s look at what theencode method does when we try it on u and s:

			
									1
								
									2
								
									3
								
									4
								
									5
								
									6
								
									7
								
										>>> u.encode('latin_1') 
									
										'Fl\xfcgel' 
									
										>>> s.encode('latin_1') 
									
										Traceback (most recent call last): 
									
										   File "", line 1, in <module> 
									
										   s.encode('latin_1') 
									
										UnicodeDecodeError: 'ascii' codec can't decode byte 0xfc in position 2: ordinal not in range(128)

Now it seems encoding the Unicode string (with the ‘latin-1’ encoding) retuned the same value as string s, but the encode method didn’t work on string s. Since we couldn’t encode s, what about decoding it? Will it give u

阅读(658) | 评论(0) | 转发(0) |

上一篇：每周转载：帮你分析天朝的房地产市场

下一篇：lvs、haproxy、nginx 负载均衡的比较分析

给主人留下些什么吧！~~

感谢所有关心和支持过ChinaUnix的朋友们

16024965号-6