分类: LINUX
2008-12-17 17:01:28
First of all i18n is shorthand for internationalisation,
The same reasoning is behind l10n.
The standard translation support on linux is "gettext".
It consists of a translations database stored in
the filesystem, utilities to manage the database
and an API (which comes with glibc) to access it.
Database:
The translations database is stored in seperate files like:
$dirname/$locale/$category/$domain.mo
an example of the variables being:
dirname=/usr/share/locale #This is the usual location
locale=en_IE #language_COUNTRY
category=LC_MESSAGES #strings in your app
domain=fslint #your app
API (to set variables above in your program):
C:
#include
bindtextdomain("fslint","/usr/share/locale");
setlocale(LC_ALL,""); /* set all locale categories to value in LC_ALL or LANG environment variables */
/* note gettext uses LC_MESSAGES category */
textdomain("fslint");
Python:
import gettext, locale
gettext.bindtextdomain("fslint", "/usr/share/locale") #sys default used if localedir=None
locale.setlocale(locale.LC_ALL,'')
gettext.textdomain("fslint")
#Note if you initially do the following, it is much
#faster as lookup for mo file not done for each translation
#(the C version automatically caches the translations so it's not needed there).
gettext.install("fslint",localedir=None,unicode=1) #None is sys default locale
#Note also before python 2.3 you need the following if
#you need translations from non python code (glibc,libglade etc.)
gtk.glade.bindtextdomain("fslint",textdomain) #there are other access points to this function
#Note python parses the translations itself, instead of letting
#glibc do it. This is for platform independence I suppose, but
#it does allow you to use python to display existing message catalogs:
$ LANG=es python
>>> import gettext
>>> gettext.install("libc")
>>> for item in gettext._translations['/usr/share/locale/es/LC_MESSAGES/libc.mo']._catalog.keys():
>>> print item, ":", gettext._translations['/usr/share/locale/es/LC_MESSAGES/libc.mo']._catalog[item]
To actually call the gettext translation functions
just replace your strings "string" with gettext("string")
The following shortcuts are usually used:
Python:
_ = gettext.gettext #Don't do if used gettext.install above (more inefficient)
print _("translated string")
C:
#define _(x) gettext(x)
printf(_("translated string"));
Utilities:
The next thing to do is extract the marked strings from your
source files for translation and insertion into the database. Python used to
have its own utility (pygettext.py) to do this, but the best way
now is to use the standard xgettext utility which now supports python.
The output from this stage is a pot file.
The last thing left to is actually do the translations.
Translators create a "po" file from the pot file above,
by just entering the text for the source strings in the pot file.
Then the developer compiles these to binary mo files for
use by the application. msgfmt and msgmerge are the main
utilities for manipulating po, pot and mo files.
The quickest way to learn about the external utilities
(xgettext, msgmerge, msgfmt) is to look at existing examples,
which are usually in po/Makefile in various projects, including: FSlint
Charsets:
Translators can represent your strings in various ways.
For e.g. the Euro symbol (€) can be encoded like:
A4 in iso-8859-15
20AC in unicode
E282AC in utf-8
All in, utf-8 is the best one to use if you can,
as it involves the least conversion and is very
efficient for primarily ascii text.
Note gtk2 only takes utf8. Note also pygtk will
auto convert from unicode to utf-8. Python will
convert translations to unicode if you specify
unicode=1 to gettext.install(). So for e.g.
if you got translations in each of the 3 encodings
above the charset translation process for pygtk
would be:
iso-8859-15 \
unicode - unicode - utf-8
utf-8 /
Misc
It's not just strings that need to be translated
in an application. For e.g. there are differing
number and date representations. To handle these
you need to use variants of the standard functions
for representing numbers to users:
C:
#include
setlocale(LC_ALL, "");
printf("%'d", 1234); /* notice the ' */
Python:
import locale
locale.setlocale(locale.LC_ALL, "")
locale.format("%d", 1234, 1) #this is a little limited as of 2.2.3
More info
info gettext
一个python很不错的列子
相对java而言,中文问题在Python中的表现更为激烈。“激烈”的意思不是说更为严重或者说难于解决,只是
Python对于decode&encode错误的默认处理方式为strict,也就是直接报错,而java使用replace的方式来处理了,因此
java出现中文问题后会打印出很多"??"。此外,Python的默认的encoding是ASCII,而java的默认encoding跟操作系统的
encoding是一致的。在这一点上,我觉得java更为合理,这样对程序员更为友好,也减少了newbies
开始时的挫折感,是有利于语言的推广的。但是,Python也有它的道理,毕竟ASCII是唯一的全世界所有平台都支持的字符集,而且问题始终是问题,始
终会出现的,逃避它还不如早点面对它。
好了,说了这么多,该说说Python中中文问题的症状了。在这之前,我们先要了解Python中有两种字符串,分别是一般的字符串(每个字符用8
bits表示)和Unicode字符串(每个字符用一个或者多个字节表示),它们可以相互转换。关于Unicode,Joel Spolsky 在 The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) 中有生动的说明,Jason Orendorff 在 有着更为全面的描述,在此我就不再多说什么了。来看下面的代码:
运行上述代码,Python会给出下面的错误提示
我们可以看到,通过引入编码声明,我们可以正常地在使用中文了,而且在code 1和2中,控制台也能正确的把中文打印出来。但是,很明显,上面的代码也反映出了不少的问题:
1、code 1 和 2在使用print时采用了不同的方式,1是直接print,而2在print之前先进行编码
2、code 1 和 2中在同样的字符串查找同一个字符“我”,得出的结果不一样(分别是5和3)
3、code 3 中直接打印unicode字符串 y时出现错误(这也是为什么code 2中要先进行编码的原因)