python默认对待unicode的方式-bailiangcn-ChinaUnix博客

劳工的天空bailiang.blog.chinaunix.net

首页　| 　博文目录　| 　关于我

bailiangcn

博客访问： 1764661
博文数量： 410
博客积分： 9563
博客等级：中将
技术积分： 4517
用户组：普通用户
注册时间： 2010-07-03 19:59

个人简介

文章分类

全部博文（410）

戒烟（0）
latex（6）
haskell（13）
儿童（18）
shell技巧（18）
随笔（15）
工作备忘（18）
python（30）

django（1）

django（1）
linux命令（55）
vim（65）
linux应用技巧（145）
c（25）
未分配的博文（2）

文章存档

2017年（6）

2016年（1）

2015年（3）

2014年（4）

2013年（32）

2012年（45）

2011年（179）

2010年（140）

我的朋友

相关博文

python默认对待unicode的方式

分类： Python/Ruby

2011-02-14 13:42:16

这里, 主要涉及到两个编码问题:
1. 文件系统使用编码方式. 这个值由 sys.getfilesystemencoding() 取得
2. python的unicode函数使用的默认解码方式. 这个值由 sys.getdefaultencoding() 取得.

世界的编码是非常之烦的一类事, 一看到locale -m的输出結果我就没有胃口了. 还是集中解决UTF-8, unicode, ascii之间的问题就够用了

locale的重要性可以说, locale对程序的行为影响是很大的. linux下的libc提供了机制方便处理这种问题. 举个例子:
jessinio@jessinio-laptop:/$ export LC_ALL='POSIX'
jessinio@jessinio-laptop:/$ locale
LANG=
LC_CTYPE="POSIX"
LC_NUMERIC="POSIX"
LC_TIME="POSIX"
LC_COLLATE="POSIX"
LC_MONETARY="POSIX"
LC_MESSAGES="POSIX"
LC_PAPER="POSIX"
LC_NAME="POSIX"
LC_ADDRESS="POSIX"
LC_TELEPHONE="POSIX"
LC_MEASUREMENT="POSIX"
LC_IDENTIFICATION="POSIX"
LC_ALL=POSIX
jessinio@jessinio-laptop:/$ bash

这时的bash环境是无法使用中文的. 就算你copy过去它也不买单, locale影响到程序对字节流的处理方式(这水深, 主要是于C函数上, 这里先不钻进去).

事件的源由
前段时间把时区的问题搞清楚了。今夜也跑来一个i18n问题。只能发挥宅男的本色。碰个杀一个。

这是一句很平常的語句： os.path.exists( path ), 在哪里使用都很正常。但是在mod_wsgi中使用就狗日的有问题：

File "/home/jessinio/data/workspace/project/home/views.py" in index
32.     os.path.exists(path)
File "/usr/lib/python2.6/genericpath.py" in exists
18.         st = os.stat(path)

Exception Type: UnicodeEncodeError at /
Exception Value: ('ascii', u'/tmp/\u6881\u5e86\u559c', 5, 8, 'ordinal not in range(128)')

os.stat出问题。为什么在一些地方python解释器可以解码，但是在mod_wsgi中又无法解码？

开始关注于C语言的i18n的处理方式。环境变量则是问题的入手点. 下面看一个证据：
python文件内容：
jessinio@jessinio-laptop:~$ cat /tmp/en.py
# coding: utf-8
import os

s = u'/tmp/梁庆喜'
os.path.exists(s)

# 下面证明了LANG环境变量的作用:
jessinio@jessinio-laptop:~$ env|grep LANG
LANG=en_US.UTF-8
GDM_LANG=en_US.UTF-8
jessinio@jessinio-laptop:~$ python /tmp/en.py

jessinio@jessinio-laptop:~$ export LANG=zh_CN.GBK
jessinio@jessinio-laptop:~$ python /tmp/en.py
Traceback (most recent call last):
File "/tmp/en.py", line 6, in
    os.path.exists(s)
File "/usr/lib/python2.6/genericpath.py", line 18, in exists
    st = os.stat(path)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 5-7: ordinal not in range(128)

先看看os.stat到底做了什么, 在Modules/posixmodule.c文件里的posix_do_stat函数这样写:

    if (!PyArg_ParseTuple(args, format,
                          Py_FileSystemDefaultEncoding, &path))
        return NULL;
    pathfree = path;

    Py_BEGIN_ALLOW_THREADS
    res = (*statfunc)(path, &st);
    Py_END_ALLOW_THREADS

那么Py_FileSystemDefaultEncoding哪里来? 下面的内容来自Python/pythonrun.c. 这个文件是python启动时使用的. 这里有设置了Py_FileSystemDefaultEncoding

#if defined(Py_USING_UNICODE) && defined(HAVE_LANGINFO_H) && defined(CODESET)
    /* On Unix, set the file system encoding according to the
       user's preference, if the CODESET names a well-known
       Python codec, and Py_FileSystemDefaultEncoding isn't
       initialized by other means. Also set the encoding of
       stdin and stdout if these are terminals, unless overridden. */

    if (!overridden || !Py_FileSystemDefaultEncoding) {
        saved_locale = strdup(setlocale(LC_CTYPE, NULL));
        setlocale(LC_CTYPE, "");
        loc_codeset = nl_langinfo(CODESET);
        if (loc_codeset && *loc_codeset) {
            PyObject *enc = PyCodec_Encoder(loc_codeset);
            if (enc) {
                loc_codeset = strdup(loc_codeset);
                Py_DECREF(enc);
            } else {
                loc_codeset = NULL;
                PyErr_Clear();
            }
        } else
            loc_codeset = NULL;
        setlocale(LC_CTYPE, saved_locale);
        free(saved_locale);

        if (!overridden) {
            codeset = icodeset = loc_codeset;
            free_codeset = 1;
        }

        /* Initialize Py_FileSystemDefaultEncoding from
           locale even if PYTHONIOENCODING is set. */
        if (!Py_FileSystemDefaultEncoding) {
            Py_FileSystemDefaultEncoding = loc_codeset;
            if (!overridden)
                free_codeset = 0;
        }
    }

Py_FileSystemDefaultEncoding 的值在python环境下也是可以取得的: sys.getfilesystemencoding()

不过, 没有set函数, python里也没有C代码提供了修改的方法. 也就是说: python启动后这个值是被固定.( 有点郁闷~~~ )

python提供了一个叫locale的module, 类似C的locale处理函数(其实就是C的locale函数封装), 但是:
* 这个库无法修改Py_FileSystemDefaultEncoding. 不要希望在在启动python后通过这个库的函数修改Py_FileSystemDefaultEncoding
* 也就是说, locale无法修改python对待file system encoding的处理方法.

BTW:: 本人试图在python启动后修改这个值做了N个努力, 我日~~~~

file system encoding的作用文件系统里存在的是文字的交换码. 比如一个文件的路径在文件系统内是utf-8方式存放的. 如:
In [34]: os.listdir('/tmp')
Out[34]:
['\xe6\xa2\x81\xe5\xba\x86\xe5\x96\x9c',]

当试图在python里使用一个unicode的字符串对象去对应文件系统里的资源时, python就会使用file system encoding的方式去编码, 如:
In [41]: a = unicode('/tmp/梁庆喜', 'utf-8')
In [42]: a
Out[42]: u'/tmp/\u6881\u5e86\u559c'
In [43]: os.path.exists(a)
Out[43]: True

python对待unicode的方法和下面的方式一致:
In [36]: a = '/tmp/梁庆喜'
In [37]: a
Out[37]: '/tmp/\xe6\xa2\x81\xe5\xba\x86\xe5\x96\x9c'
In [38]: os.path.exists(a)
Out[38]: True

如果os.path.exist的参数是unicode的话, 它将会使用file system encoding的方式去对unicode编码. 然后使用系统的API.

default encoding的作用下面使用一个例子就可以看到这个问题:
>>> import sys
>>> sys.getdefaultencoding()
'ascii'
>>> a = '梁庆喜'
>>> unicode(a)
Traceback (most recent call last):
File "", line 1, in
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe6 in position 0: ordinal not in range(128)

unicode函数在不提供第二个参数时, 就要使用default encoding的解码方式.

平时启动python的方式python会把setdefaultencoding方法从sys中删除.

我们可以重载这个编码方式: 使用python的-S参数启动python:
>>> import sys
>>> sys.getdefaultencoding()
'ascii'
>>> sys.setdefaultencoding('utf-8')
>>> sys.getdefaultencoding()
'utf-8'
>>> a = '梁庆喜'
>>> unicode(a)
u'\u6881\u5e86\u559c'

如果不想启动时使用-S参数, 也可以修改/usr/lib/python2.6/site.py里的setencoding函数

阅读(3119) | 评论(0) | 转发(0) |

上一篇：解决mplayctrl模块中文支持的问题

下一篇：Mplayer 音量控制详解

给主人留下些什么吧！~~

感谢所有关心和支持过ChinaUnix的朋友们

16024965号-6