The most common for our Chinese encoding is GB2312, GBK, GB18030, Big5(taiwan), Unicode. So i confronted many many problems when i was processing the Sogou Internet corpus. The first problem is word segmentation, thanks to my laboratory's guy, this problem is solved. However, his tool is for GBK only, so it is fit for Sogou corpus. While i often do my work with zh_CN.UTF-8 locale in my OpenSuSE. Besides, i found the computing station's locale is UTF-8 too, so i have to convert it from GBK to UTF-8. Then the nightmare begins..... At first, i use iconv under Linux to do this job, but it encounters alot of illegal characters. Then i write a small C tool to filter the characters beyond the GBK encoding range, but it still have some illegal characters, i am confused then....
After a long silence, i decided to drop some lines contains illegal characters. i use head and tail to locate these lines. As a result, i drop about 20000 lines. Finally, it works!
The next step is to train the Languge Model and design some index structure to access it quickly.
阅读(917) | 评论(0) | 转发(0) |