UTF-8 mode处理中文问题-snowtty-ChinaUnix博客

冰雪塵埃snowtty.blog.chinaunix.net

首页　| 　博文目录　| 　关于我

snowtty

博客访问： 5433976
博文数量： 1144
博客积分： 11974
博客等级：上将
技术积分： 12312
用户组：普通用户
注册时间： 2005-04-13 20:06

文章分类

全部博文（1144）

技术--Oracle&MyS（112）
编程--python编程（21）
编程--perl程序（183）
技术--防火墙类（9）
技术--samba类（7）
技术--apache类（18）
技术--netfilter（7）
工作--工作??（62）
生活--情感生活（116）
学习--英语学习（24）
学习--早先日志（46）
工作--周报总结（36）
学习--学习笔记（45）
技术--Rsync维护（11）
技术--OPENldap（1）
技术--squid维护（9）
技术--DNS 维护（17）
技术--FTP 维护（7）
技术--qmail维护（128）
技术--网络技术（26）
技术--linux 类（183）

openvpn（0）

nagios（10）
编程--awk&sed（11）
编程--shell编程（50）
未分配的博文（15）

文章存档

2017年（2）

2016年（14）

2015年（10）

2014年（28）

2013年（23）

2012年（29）

2011年（53）

2010年（86）

2009年（83）

2008年（43）

2007年（153）

2006年（575）

2005年（45）

我的朋友

相关博文

UTF-8 mode处理中文问题

分类： LINUX

2008-10-01 09:37:39

發現其實我沒對這個作筆記，剛好和 PipperL 在他的 blog 裡聊到，就順便作個簡單的說明好了。

--------------------------------------------------------------------------------

問題描述

為什麼要用 UTF-8 mode 執行 perl 呢？因為 Perl 的字串預設是 byte string，對於使用 ASCII 的人來說，沒有影響，但對於 CJK 使用者來說，就很麻煩。舉例來說，以下程式使用的 regular expression 會無法正確 match 出「英」字，因為其 big5 碼的第二個 byte 是 '^' 符號，導致 regular expression 錯誤：

SHELL> more -x4 plain.pl#!/usr/bin/perl -w# Source encoded in big5.my $s = '英雄人物';if ($s =~ m/英/o) { print "是我\n";}else { print "不是我\n";}

SHELL> ./plain.pl不是我因此，我們必須寫成彆扭的：

SHELL> echo '英' | hexdump -C00000000 ad 5e 0a |.^.|00000003

SHELL> more -x4 hacked.pl#!/usr/bin/perl -w# Source encoded in big5.

my $s = '英雄人物';if ($s =~ m/[\xAD\x5E]/o) { # 英 print "是我\n";}else { print "不是我\n";}如果，perl 程式能夠在 regular expression 裡使用 Unicode，那就沒有這個問題了。解法請用 perl 5.8.6 以上，在程式最前面下：

use utf8;這樣子程式裡面所有字串都是使用 utf8 編碼，若有需要，再在特定 block 裡用 use bytes 切回使用 byte string。這樣，上面的程式就可以正常運作了：

SHELL> more -x4 u8mode.pl#!/usr/bin/perl -w# Source encoded in utf8.use utf8;my $s = '英雄人物';if ($s =~ m/英/o) { print "是我\n";}else { print "不是我\n";}

SHELL> ./u8mode.plWide character in print at ./u8mode.pl line 11.是我不過，在 use utf8 之後，Perl I/O 也會假設外面也是用 utf8，但通常讀進來或要寫出去的，是 big5，所以會跑出「Wide character in print at ./u8mode.pl line 6.」的訊息出來。因此，我們要加寫這幾行，讓 perl 知道外面是用哪一種 encoding：

binmode(STDIN, ':encoding(big5)');binmode(STDOUT, ':encoding(big5)');binmode(STDERR, ':encoding(big5)');如果有自己開的檔，也比照辦理。最終的版本如下：

SHELL> more -x4 u8mode.pl#!/usr/bin/perl -w# Source encoded in utf8.use utf8;binmode(STDIN, ':encoding(big5)');binmode(STDOUT, ':encoding(big5)');binmode(STDERR, ':encoding(big5)');my $s = '英雄人物';if ($s =~ m/英/o) { print "是我\n";}else { print "不是我\n";}

SHELL> ./u8mode.pl是我另外要注意的是，use utf8 只能對 perl 本身提供的語言機制產生作用，對於 3rd party libraries 不一定有作用。因此，使用 DBI 搭配 use utf8 時，要記得 fetchrow_xxxx() 得到的東西，要特別因應原始來源的 encoding 作處理，利用 _utf8_on() 或 _utf8_off() 直接設定字串的 utf8 flag，以免重複轉碼或沒有轉碼。詳情請 perldoc Encode，在此就不再多述。

2007-02-27 補充：

研究了一下在 perl 裡，正/簡體中文編碼叫什麼。

依據 Encode::TW 與 Encode::CN 的說明，針對 big5 與 gb2312 編碼，在 perl 裡，我們可以使用 big5 與 gb2312 來處理網頁、郵件通常會遇到的字碼。不過，為了因應大部分人的使用習慣 (windows user)，或許用 cp950 與 cp936 更不會碰到麻煩。

Encode::TW 是這麼說的：

Since the original "big5" encoding (1984) is not supported anywhere (glibc and DOS-based systems uses "big5" to mean "big5-eten"; Microsoft uses "big5" to mean "cp950"), a conscious decision was made to alias "big5" to "big5-eten", which is the de facto superset of the original big5.

通常來說，在 UNIX 裡我們用的 big5 是 big5-eten，在 Windows 下，用的則是 cp950，cp950 是 Code Page 950 的意思，只比 big5-eten 還多了一個歐元符號

阅读(2507) | 评论(0) | 转发(0) |

上一篇：在perl中怎么匹配这一行中最后26个数,要求方法最简单

下一篇：一个hash 的案例

给主人留下些什么吧！~~

感谢所有关心和支持过ChinaUnix的朋友们

16024965号-6