UTF8 下中文处理 [转载]-coolend-ChinaUnix博客

404&nbsp;NOT&nbsp;FOUNDmuddyboot.blog.chinaunix.net

首页　| 　博文目录　| 　关于我

coolend

博客访问： 599831
博文数量： 40
博客积分： 7274
博客等级：少将
技术积分： 410
用户组：普通用户
注册时间： 2005-12-20 15:00

个人简介

Expired

文章分类

全部博文（40）

未分配的博文（40）

文章存档

2011年（1）

2008年（3）

2007年（17）

2006年（10）

2005年（9）

我的朋友

最近访客

推荐博文

UTF8 下中文处理 [转载]

分类：

2007-01-21 11:31:49

原文地址:

先看一个例子：

先建一个文件 test.txt，写上“中文”两个字。运行下面一个程序：

use strict;
use warnings;
use Encode;
my $str_code = "中文";
my $file = "test.txt";

open(my $fh, $file) || die "Can't open file $file: $!";
chomp(my $str_file = <$fh>);
close $fh;

open(my $fh_utf8, "<:utf8", $file) || die "Can't open file $file: $!";
chomp(my $str_fileu = <$fh_utf8>);
close $fh_utf8;

test_utf8($str_code, "code");
test_utf8($str_file, "file");
test_utf8($str_fileu, "file(utf8 layer)");
# print $str_code, "\n";
# print $str_file, "\n";
# print $str_fileu, "\n";

Encode::_utf8_on($str_code);
Encode::_utf8_on($str_file);
test_utf8($str_code, "code");
test_utf8($str_file, "file");

sub test_utf8 {
    my ($str, $type) = @_;
    if ( Encode::is_utf8($str) ) {
        print "String from $type, utf8 flag is on."
    } else {
        print "String from $type, utf8 flag is off."
    }
    my @chars = split('', $str);
    # print join("\t", @chars), "length: ", scalar(@chars);
    if (scalar(@chars)==2) {
        print " Can match chinese character.";
    } else {
        print " Can't match chinese character.";
    }
    print "\n";
}

结果如下：

String from code, utf8 flag is off. Can't match chinese character.
String from file, utf8 flag is off. Can't match chinese character.
String from file(utf8 layer), utf8 flag is on. Can match chinese character.
Turn on utf flag:
String from code, utf8 flag is on. Can match chinese character.
String from file, utf8 flag is on. Can match chinese character.

所以如果要想使用汉字作为一个字符的特性，就要在 open 里指明 io layer 为 utf8，同样，输出指明 utf8，然后在脚本里写 use utf8。这样你就不用担心 utf8 下的乱码，又能享受汉字做为一个字符来写正则表达式的的畅快。

如果还不知道我上面说的意思，再给一个例子。还是新建一个文件 test.txt，写上：

求
汉
汁
汗
汕
江

运行下面的脚本：

use strict;
use warnings;
use Encode;
use utf8;

my $file = "test.txt";
my $outfile = "test-out.txt";
# open(my $out, '>:utf8', $outfile) || die "Can't open file $outfile: $!";
# select $out;
# open(my $fh, $file) || die "Can't open file $file: $!";
open(my $fh, '<:utf8', $file) || die "Can't open file $file: $!";
binmode STDOUT, ":utf8";
while (<$fh>) {
    if (/[汉]/) {
        unless (Encode::is_utf8($_)) {
            Encode::_utf8_on($_);
        }
        print $_;
    }
}

这个例子中，有好几个组合：

不使用 utf8 指令，open 中也不使用 utf8。这时所有行都匹配。因为正则表达式中是三个字节（我假定你也用 utf8 来编码代码，如果不是，可以不匹配任何行），而所有输入的行中都含有这三个字节中的一个，事实上 test.txt 是我精心挑选的，前两个字节都相同的汉字。

不使用 utf8 指令, open 中使用 utf8。不能匹配任何行。因为输入文件中是汉字，而正则表达式是三个字节。
使用 utf8 指令，open 中不使用 utf8。也不能匹配任何行。因为正则表达式中是一个汉字，而输入文件是一个个的字节。
使用 utf8 指令，open 中也使用 utf8。正确匹配一行。
有时候在你没有指明 utf8 作 io layer 时，可能会看到一个 warning:
```
Wide character in print at script.pl line 32,  line 1.
```
这个 warning 确实是一个小小的 warning，不会影响结果。如果不想有这个 warning，最好的办法是用 binmode 或者在 open 中指定使用 utf8 layer。或者对于 warning 的字符串用 Encode 模块中 _utf8_on 函数，强制加上 utf8 flag。

阅读(1521) | 评论(0) | 转发(0) |

上一篇：OpenLDAP+FreeRADIUS+MySQL+RP-PPPOE 构建PPPOE服务器

下一篇：解决处理 UTF-8 编码XML文件时，输出和写入乱码问题

给主人留下些什么吧！~~

感谢所有关心和支持过ChinaUnix的朋友们

16024965号-6