Chinaunix首页 | 论坛 | 博客
  • 博客访问: 4608637
  • 博文数量: 1214
  • 博客积分: 13195
  • 博客等级: 上将
  • 技术积分: 9105
  • 用 户 组: 普通用户
  • 注册时间: 2007-01-19 14:41
个人简介

C++,python,热爱算法和机器学习

文章分类

全部博文(1214)

文章存档

2021年(13)

2020年(49)

2019年(14)

2018年(27)

2017年(69)

2016年(100)

2015年(106)

2014年(240)

2013年(5)

2012年(193)

2011年(155)

2010年(93)

2009年(62)

2008年(51)

2007年(37)

分类: Python/Ruby

2016-09-17 13:54:16

原文地址:
This article is on Unicode with Python 2.x If you want to learn about Unicode for Python 3.x, be sure to checkout our  article. Also, if you're interested in checking if a Unicode string is a number, be sure to checkout our article on .

Strings are among the most commonly used data types in Python, and there might be times when you want to (or have to) work with strings containing or entirely made up of characters outside of the standard ASCII set (e.g. characters with accents or other markings).

Python 2.x provides a data type called a Unicode string for working with Unicode data using string encoding and decoding methods. If you want to learn more about Unicode strings, be sure to checkout Wikipedia's article on .

Note: When executing a Python script that contains Unicode characters, you must put the following line at the top of the script, to tell Python that the code is UTF-8/Unicode formatted.



Python Unicode: Overview

In order to figure out what “encoding” and “decoding” is all about, let’s look at an example string:



We can see our string s has a non-ASCII character in it, namely “ü” or “umlaut-u.”. Assuming we’re in the standard Python 2.x interactive mode, let’s see what happens when we reference the string, and when it’s printed:



Printing gave us the value that we assigned to the variable, but something obviously happened along the way that turned it from what we typed into the interpreter to something seemingly incomprehensible. The non-ASCII character ü was translated into a code phrase, i.e. “\xfc,“ by a set of rules behind-the-scenes. In other words, it was encoded.

At this point, s is an 8-bit string, which to us basically means it isn’t a Unicode string. Let’s examine how to make a Unicode string with the same data. The simplest way is with a “u” prefix in front of the literal string marking it as a Unicode string:



If we reference and print u like we did with s, we’ll find something similar:



We can see that the code phrase for our “umlaut-u” is still “\xfc“ and it prints the same—so does that mean our Unicode string is encoded the same way as our 8-bit string s? To figure that out let’s look at what theencode method does when we try it on u and s:



Now it seems encoding the Unicode string (with the ‘latin-1’ encoding) retuned the same value as string s, but the encode method didn’t work on string s. Since we couldn’t encode s, what about decoding it? Will it give u

阅读(635) | 评论(0) | 转发(0) |
给主人留下些什么吧!~~