Chinaunix首页 | 论坛 | 博客
  • 博客访问: 405932
  • 博文数量: 21
  • 博客积分: 5030
  • 博客等级: 大校
  • 技术积分: 1275
  • 用 户 组: 普通用户
  • 注册时间: 2006-06-16 09:18
文章分类
文章存档

2012年(1)

2011年(6)

2010年(2)

2009年(1)

2008年(11)

我的朋友

分类:

2008-04-24 16:19:56

The most common for our Chinese encoding is GB2312, GBK, GB18030, Big5(taiwan), Unicode. So i confronted many many problems when i was processing  the Sogou Internet corpus. The first problem is word segmentation, thanks to my laboratory's guy, this problem is solved. However, his tool is for GBK only, so it is fit for Sogou corpus. While i often do my work with  zh_CN.UTF-8 locale in my OpenSuSE. Besides, i found the computing station's locale is UTF-8 too, so i have to convert it from GBK to UTF-8. Then the nightmare begins..... At first, i use iconv under Linux to do this job, but it encounters alot of illegal characters. Then i write a small C tool to filter the characters beyond the GBK encoding range, but it still have some illegal characters, i am confused then....
After a long silence, i decided to drop some lines contains illegal characters. i use head and tail to locate these lines. As a result, i drop about 20000 lines. Finally, it works!

The next step is to train the Languge Model and design some index structure to access it  quickly.
 
阅读(870) | 评论(0) | 转发(0) |
给主人留下些什么吧!~~