C++,python,热爱算法和机器学习
全部博文(1214)
分类: Python/Ruby
2016-09-17 13:54:16
Strings are among the most commonly used data types in Python, and there might be times when you want to (or have to) work with strings containing or entirely made up of characters outside of the standard ASCII set (e.g. characters with accents or other markings).
Python 2.x provides a data type called a Unicode string for working with Unicode data using string encoding and decoding methods. If you want to learn more about Unicode strings, be sure to checkout Wikipedia's article on .
Note: When executing a Python script that contains Unicode characters, you must put the following line at the top of the script, to tell Python that the code is UTF-8/Unicode formatted.
In order to figure out what “encoding” and “decoding” is all about, let’s look at an example string:
We can see our string s has a non-ASCII character in it, namely “ü” or “umlaut-u.”. Assuming we’re in the standard Python 2.x interactive mode, let’s see what happens when we reference the string, and when it’s printed:
Printing gave us the value that we assigned to the variable, but something obviously happened along the way that turned it from what we typed into the interpreter to something seemingly incomprehensible. The non-ASCII character ü was translated into a code phrase, i.e. “\xfc,“ by a set of rules behind-the-scenes. In other words, it was encoded.
At this point, s is an 8-bit string, which to us basically means it isn’t a Unicode string. Let’s examine how to make a Unicode string with the same data. The simplest way is with a “u” prefix in front of the literal string marking it as a Unicode string:
If we reference and print u like we did with s, we’ll find something similar:
We can see that the code phrase for our “umlaut-u” is still “\xfc“ and it prints the same—so does that mean our Unicode string is encoded the same way as our 8-bit string s? To figure that out let’s look at what theencode method does when we try it on u and s:
Now it seems encoding the Unicode string (with the ‘latin-1’ encoding) retuned the same value as string s, but the encode method didn’t work on string s. Since we couldn’t encode s, what about decoding it? Will it give u