中文二元切词
保留中文单字
英文单词
去除其他符号
忽略字母数小于三个的英文单词
英文统一转成小写
调用时传入的字符串应该是 utf8 flag 没有打开(未 decode )的状态。
调用示例:
- use Hukaa::Search;
- my @results = cut_words("中华人民共和国 hel-HHHlo呵呵[]哈哈wor\\ld下What不为例啊");
- print join("\n", @results), "\n";
结果:
源代码:
- package Hukaa::Search;
- use strict;
- use warnings;
- use Encode qw(decode_utf8 encode_utf8);
- use base qw(Exporter);
- our @EXPORT_OK = qw(cut_words);
- our @EXPORT = @EXPORT_OK;
- sub cut_words {
- my $str = shift;
- $str = decode_utf8($str);
- my @words;
- my $buf;
- my $state = 0; # 0 - out, 1 - in chinese, 2 - in english
- my $two_chinese = 0;
- for my $i (0 .. length($str)) {
- my $char = substr($str, $i, 1);
- #print $char, "\t";
- if (ord($char) > 255) { # chinese
- if ($state == 0) {
- $buf = $char;
- } elsif ($state == 1) {
- $buf .= $char;
- push @words, encode_utf8($buf);
- $two_chinese = 1;
- $buf = $char;
- } elsif ($state == 2) {
- push @words, lc($buf) if length($buf) > 2;
- $buf = $char;
- }
- $state = 1;
- } elsif ($char =~ /\w/) { # english word
- if ($state == 0) {
- $buf = $char;
- } elsif ($state == 1) {
- if ($two_chinese == 0) {
- push @words, encode_utf8($buf);
- } else {
- $two_chinese = 0;
- }
- $buf = $char;
- } elsif ($state == 2) {
- $buf .= $char;
- }
- $state = 2;
- } else { # separaters
- if ($state == 0) {
- } elsif ($state == 1) {
- if ($two_chinese == 0) {
- push @words, encode_utf8($buf);
- } else {
- $two_chinese = 0;
- }
- $buf = "";
- } elsif ($state == 2) {
- push @words, lc($buf) if length($buf) > 2;
- $buf = "";
- }
- $state = 0;
- }
- }
- return @words;
- }
- 1;
阅读(380) | 评论(0) | 转发(0) |