Character encoding is a hard problem!-剑胆琴心-ChinaUnix博客

剑胆琴心

首页　| 　博文目录　| 　关于我

剑胆琴心

博客访问： 402038
博文数量： 20
博客积分： 5010
博客等级：大校
技术积分： 1270
用户组：普通用户
注册时间： 2006-06-16 09:18

文章分类

全部博文（20）

酒中真意（0）
人生哲理（0）
Perl学习（0）
个人略传（0）
开源项目（0）

Google summer co（0）

video conference（0）
寻章摘句（1）
记忆碎片（0）
技术人生（13）
心情随笔（5）
未分配的博文（1）

文章存档

2011年（6）

2010年（2）

2009年（1）

2008年（11）

我的朋友

最近访客

推荐博文

Character encoding is a hard problem!

分类：

2008-04-24 16:19:56

The most common for our Chinese encoding is GB2312, GBK, GB18030, Big5(taiwan), Unicode. So i confronted many many problems when i was processing the Sogou Internet corpus. The first problem is word segmentation, thanks to my laboratory's guy, this problem is solved. However, his tool is for GBK only, so it is fit for Sogou corpus. While i often do my work with zh_CN.UTF-8 locale in my OpenSuSE. Besides, i found the computing station's locale is UTF-8 too, so i have to convert it from GBK to UTF-8. Then the nightmare begins..... At first, i use iconv under Linux to do this job, but it encounters alot of illegal characters. Then i write a small C tool to filter the characters beyond the GBK encoding range, but it still have some illegal characters, i am confused then....
After a long silence, i decided to drop some lines contains illegal characters. i use head and tail to locate these lines. As a result, i drop about 20000 lines. Finally, it works!

The next step is to train the Languge Model and design some index structure to access it quickly.

阅读(917) | 评论(0) | 转发(0) |

上一篇：[Perl]How to select a random line from a file?

下一篇：Something of these days

给主人留下些什么吧！~~

感谢所有关心和支持过ChinaUnix的朋友们

16024965号-6