Python C i18n-liubingzhq-ChinaUnix博客

平静如水

首页　| 　博文目录　| 　关于我

liubingzhq

博客访问： 252165
博文数量： 108
博客积分： 3285
博客等级：中校
技术积分： 1360
用户组：普通用户
注册时间： 2008-04-28 15:43

文章分类

全部博文（108）

感悟（1）
未分配的博文（107）

文章存档

2014年（1）

2012年（3）

2011年（28）

2010年（20）

2009年（24）

2008年（32）

我的朋友

小尾巴鱼

相关博文

Python C i18n

分类： LINUX

2008-12-17 17:01:28

First of all i18n is shorthand for internationalisation,
The same reasoning is behind l10n.

The standard translation support on linux is "gettext".
It consists of a translations database stored in
the filesystem, utilities to manage the database
and an API (which comes with glibc) to access it.

Database:

    The translations database is stored in seperate files like:

        $dirname/$locale/$category/$domain.mo

    an example of the variables being:

        dirname=/usr/share/locale    #This is the usual location
        locale=en_IE                 #language_COUNTRY
        category=LC_MESSAGES         #strings in your app
        domain=fslint                #your app

API (to set variables above in your program):

    C:

        #include 
        bindtextdomain("fslint","/usr/share/locale");
        setlocale(LC_ALL,""); /* set all locale categories to value in LC_ALL or LANG environment variables */
        /* note gettext uses LC_MESSAGES category */
        textdomain("fslint");

    Python:

        import gettext, locale
        gettext.bindtextdomain("fslint", "/usr/share/locale") #sys default used if localedir=None
        locale.setlocale(locale.LC_ALL,'')
        gettext.textdomain("fslint")

        #Note if you initially do the following, it is much
        #faster as lookup for mo file not done for each translation
        #(the C version automatically caches the translations so it's not needed there).
        gettext.install("fslint",localedir=None,unicode=1) #None is sys default locale

        #Note also before python 2.3 you need the following if
        #you need translations from non python code (glibc,libglade etc.)
        gtk.glade.bindtextdomain("fslint",textdomain) #there are other access points to this function

        #Note python parses the translations itself, instead of letting
        #glibc do it. This is for platform independence I suppose, but
        #it does allow you to use python to display existing message catalogs:
        $ LANG=es python
        >>> import gettext
        >>> gettext.install("libc")
        >>> for item in gettext._translations['/usr/share/locale/es/LC_MESSAGES/libc.mo']._catalog.keys():
        >>>     print item, ":",  gettext._translations['/usr/share/locale/es/LC_MESSAGES/libc.mo']._catalog[item]

    To actually call the gettext translation functions
    just replace your strings "string" with gettext("string")
    The following shortcuts are usually used:

    Python:
        _ = gettext.gettext #Don't do if used gettext.install above (more inefficient)
        print _("translated string")

    C:
        #define _(x) gettext(x)
        printf(_("translated string"));

Utilities:

    The next thing to do is extract the marked strings from your
    source files for translation and insertion into the database. Python used to
    have its own utility (pygettext.py) to do this, but the best way
    now is to use the standard xgettext utility which now supports python.
    The output from this stage is a pot file.

    The last thing left to is actually do the translations.
    Translators create a "po" file from the pot file above,
    by just entering the text for the source strings in the pot file.
    Then the developer compiles these to binary mo files for
    use by the application. msgfmt and msgmerge are the main
    utilities for manipulating po, pot and mo files.

    The quickest way to learn about the external utilities
    (xgettext, msgmerge, msgfmt) is to look at existing examples,
    which are usually in po/Makefile in various projects, including: FSlint

Charsets:

   Translators can represent your strings in various ways.
   For e.g. the Euro symbol (€) can be encoded like:

         A4 in iso-8859-15
       20AC in unicode
     E282AC in utf-8

   All in, utf-8 is the best one to use if you can,
   as it involves the least conversion and is very
   efficient for primarily ascii text.

   Note gtk2 only takes utf8. Note also pygtk will
   auto convert from unicode to utf-8. Python will
   convert translations to unicode if you specify
   unicode=1 to gettext.install(). So for e.g.
   if you got translations in each of the 3 encodings
   above the charset translation process for pygtk
   would be:

   iso-8859-15 \
   unicode      - unicode - utf-8
   utf-8       /

Misc

   It's not just strings that need to be translated
   in an application. For e.g. there are differing
   number and date representations. To handle these
   you need to use variants of the standard functions
   for representing numbers to users:

   C:
       #include 
       setlocale(LC_ALL, "");
       printf("%'d", 1234); /* notice the ' */

   Python:
       import locale
       locale.setlocale(locale.LC_ALL, "")
       locale.format("%d", 1234, 1) #this is a little limited as of 2.2.3

More info

   info gettext
一个python很不错的列子

相对java而言，中文问题在Python中的表现更为激烈。“激烈”的意思不是说更为严重或者说难于解决，只是 Python对于decode＆encode错误的默认处理方式为strict，也就是直接报错，而java使用replace的方式来处理了，因此 java出现中文问题后会打印出很多"??"。此外，Python的默认的encoding是ASCII，而java的默认encoding跟操作系统的 encoding是一致的。在这一点上，我觉得java更为合理，这样对程序员更为友好，也减少了newbies 开始时的挫折感，是有利于语言的推广的。但是，Python也有它的道理，毕竟ASCII是唯一的全世界所有平台都支持的字符集，而且问题始终是问题，始终会出现的，逃避它还不如早点面对它。
好了，说了这么多，该说说Python中中文问题的症状了。在这之前，我们先要了解Python中有两种字符串，分别是一般的字符串（每个字符用8 bits表示）和Unicode字符串（每个字符用一个或者多个字节表示），它们可以相互转换。关于Unicode，Joel Spolsky 在 The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) 中有生动的说明，Jason Orendorff 在有着更为全面的描述，在此我就不再多说什么了。来看下面的代码：

x = u"中文你好"
print s

运行上述代码，Python会给出下面的错误提示

SyntaxError: Non-ASCII character '\xd6' in file G:\workspace\chinese_problem\src\test.py on line 1, but no encoding declared; see for details

说是遇到非ASCII字符了，并让我们参考pep-0263。PEP-0263（Python Enhancement Proposal）上面说得很清楚了，Python也意识到了国际化问题，并提出了解决方案。根据提案上面的要求，我们有如下代码

# -*- coding:gb2312 -*- ＃必须在第一行或者第二行
print "-------------code 1----------------"
a = "中文a我爱你"
print a
print a.find("我")
b = a.replace("爱", "喜欢")
print b
print "--------------code 2----------------"
x = "中文a我爱你"
y = unicode(x, "gb2312")
print y.encode("gb2312")
print y.find(u"我")
z = y.replace(u"爱", u"喜欢")
print z.encode("gb2312")
print "---------------code 3----------------"
print y

程序运行的结果如下：

-------------code 1----------------
中文a我爱你
5
中文a我喜欢你
--------------code 2----------------
中文a我爱你
3
中文a我喜欢你
---------------code 3----------------
Traceback (most recent call last):
File "G:\Downloads\eclipse\workspace\p\src\hello.py", line 16, in
print y
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-1: ordinal not in range(128)

     我们可以看到，通过引入编码声明，我们可以正常地在使用中文了，而且在code 1和2中，控制台也能正确的把中文打印出来。但是，很明显，上面的代码也反映出了不少的问题：
    1、code 1 和 2在使用print时采用了不同的方式，1是直接print，而2在print之前先进行编码
    2、code 1 和 2中在同样的字符串查找同一个字符“我”，得出的结果不一样（分别是5和3）
    3、code 3 中直接打印unicode字符串 y时出现错误（这也是为什么code 2中要先进行编码的原因）

    为什么？为什么？我们可以先在脑海中模拟一下我们使用Python的流程：首先，我们先用编辑器编写好源代码，保存成文件。如果源代码中有编码声明而且用的编辑器支持该语法，那么该文件就以相应的编码方式保存在磁盘中。注意：编码声明和源文件的编码不一定是一致的，你完全可以在编码声明中声明编码为UTF-8，但是用GB2312来保存源文件。当然，我们不可能自寻烦恼，故意写错，而且好的IDE也能强制保证两者的一致性，但是，如果我们用记事本或者EditPlus等编辑器来编写代码的话，一不小心就会出现这种问题的。
    得到一个.py文件后，我们就可以运行它了，这是，我们就把代码交给Python解析器来完成解析工作。解析器读入文件时，先解析文件中的编码声明，我们假设文件的编码为gb2312，那么先将文件中的内容由gb2312转换成Unicode，然后再把这些Unicode转换为UTF-8格式的字节串。完成这一步骤后，解析器把这些UTF-8字节串分段，解析。如果遇到使用Unicode字符串，那么就使用相应的UTF-8字节串创建Unicode字符串，如果程序中使用的是一般的字符串，那么解析器先将UTF-8字节串通过Unicode转换成相应编码（这里就是gb2312编码）的字节串，并用其创建一般的字符串对象。也就是说，Unicode字符串跟一般字符串在内存中的存放格式是不一样的，前者使用UTF-8的格式，后者使用GB2312格式。
    好了，内存中的字符串存放格式我们知道了，下面我们要了解print的工作方式。print其实只是负责把内存中相应的字节串交给操作系统，让操作系统相应的程序（譬如cmd窗口）进行显示。这里有两种情况：
   1、若字符串是一般的字符串，那么print只需把内存中相应的字节串推送给操作系统。如例子中的code 1。
    2、如果字符串是Unicode字符串，那么print在推送之前先进行相应的encode：我们可以显示使用Unicode的encode方法使用合适的编码方式来编码（例子中code 2），否则Python使用默认的编码方式进行编码，也就是ASCII（例子中的code 3）。当然ASCII是不可能正确编码中文的，因此Python报错。
    至此，上面的三个问题我们已经可以解析第一和第三个了。至于第二个问题，因为Python中有两种字符串，一般字符串和Unicode字符串，两者都有各自的字符处理方法。对于前者，方法是以字节的方式进行的，而且在GB2312中，每个汉字占用两个字节，因此得到的结果是5；对于后者，也就是 Unicode字符串，所有字符都是统一看待的，因此得到3。
     虽然上面只提到了控制台程序的中文问题，但是文件读写以及网络传输中出现的中文问题在原理上都是类似的。Unicode的出现可以很大程度上解决软件的国际化问题，同时Python为Unicode提供了极为良好的支持，因此，我建议大家在编写Python的程序时，都统一使用Unicode方式。保存文件时使用UTF-8的编码方式。How to Use UTF-8 with Python有详细的描述，大家可以参考一下。
    Python中能导致出现中文问题的地方还很多，譬如文件的读写，网络数据的传输等，希望大家能多多交流，共同解决这些问题。

阅读(837) | 评论(0) | 转发(0) |

上一篇：linux firefox flash 没声音

下一篇：URI RFC

给主人留下些什么吧！~~

感谢所有关心和支持过ChinaUnix的朋友们

16024965号-6