[python]将搜狗(sogou)的细胞词库转换为mmseg的词库-flynetcn-ChinaUnix博客

flynetcnflynetcn.blog.chinaunix.net

首页　| 　博文目录　| 　关于我

flynetcn

博客访问： 1213688
博文数量： 252
博客积分： 5421
博客等级：大校
技术积分： 2418
用户组：普通用户
注册时间： 2007-06-17 12:59

文章分类

全部博文（252）

search（4）
python（12）
VC++（1）
GUI（1）
C code（4）
网络编程（7）
网站架构（8）
HTML（3）
linux（17）
tools（31）
java（19）
thrift（2）
行业动态（1）
asp（3）
sql server（10）
mysql（5）
职业生涯（24）
php（44）
C（17）
perl（0）
js（30）
Hacker（6）
未分配的博文（3）

文章存档

2017年（3）

2016年（18）

2015年（31）

2014年（18）

2013年（7）

2012年（8）

2011年（12）

2010年（30）

2009年（32）

2008年（57）

2007年（36）

我的朋友

相关博文

[python]将搜狗(sogou)的细胞词库转换为mmseg的词库

分类： Python/Ruby

2014-08-11 15:56:05

From:
------------------------------------------------------------

将搜狗(sogou)的细胞词库转换为mmseg的词库

功能：

scel2mmseg.py: 将.scel文件转换为mmseg格式的.txt文件

使用方法： python scel2mmseg.py a.scel a.txt

批量转换方法：python scel2mmseg.py scel文件目录 a.txt

说明：新增加的所有词的词频都为1，对于格式的解释如下：[摘自 ]

每条记录分两行。其中，第一行为词项，其格式为：[词条]\t[词频率]。需要注意的是，对于单个字后面跟这个字作单字成词的频率，这个频率需要在大量的预先切分好的语料库中进行统计，用户增加或删除词时，一般不需要修改这个数值；对于非单字词，词频率处必须为1。第二行为占位项，是由于 LibMMSeg库的代码是从Coreseek其他的分词算法库（N-gram模型）中改造而来的，在原来的应用中，第二行为该词在各种词性下的分布频率。LibMMSeg的用户只需要简单的在第二行处填”x:1″即可
mergedict.py: 将mmseg的多个.txt文件合并为一个.txt

使用方法： python mergedict.py unigram.txt b.txt c.txt new.txt

说明： .txt可以使mmseg格式的，也可以是每行一个词的格式（这样词频默认为1）

注意：因为merge的时候会判重，一个词在前面出现过，就不会追加到新产生的文件中,所以要将unigram.txt放到最前面

------------------------------------------------------------

scel2mmseg.py:
------------------------------------------------------------

import struct
import os, sys, glob
def read_utf16_str (f, offset=-1, len=2):
if offset >= 0:
f.seek(offset)
str = f.read(len)
return str.decode('UTF-16LE')
def read_uint16 (f):
return struct.unpack (', f.read(2))[0]
def get_word_from_sogou_cell_dict (fname):
f = open (fname, 'rb')
file_size = os.path.getsize (fname)
hz_offset = 0
mask = struct.unpack ('B', f.read(128)[4])[0]
if mask == 0x44:
hz_offset = 0x2628
elif mask == 0x45:
hz_offset = 0x26c4
else:
sys.exit(1)
title = read_utf16_str (f, 0x130, 0x338 - 0x130)
type = read_utf16_str (f, 0x338, 0x540 - 0x338)
desc = read_utf16_str (f, 0x540, 0xd40 - 0x540)
samples = read_utf16_str (f, 0xd40, 0x1540 - 0xd40)
py_map = {}
f.seek(0x1540+4)
while 1:
py_code = read_uint16 (f)
py_len = read_uint16 (f)
py_str = read_utf16_str (f, -1, py_len)
if py_code not in py_map:
py_map[py_code] = py_str
if py_str == 'zuo':
break
f.seek(hz_offset)
while f.tell() != file_size:
word_count = read_uint16 (f)
pinyin_count = read_uint16 (f) / 2
py_set = []
for i in range(pinyin_count):
py_id = read_uint16(f)
py_set.append(py_map[py_id])
py_str = "'".join (py_set)
for i in range(word_count):
word_len = read_uint16(f)
word_str = read_utf16_str (f, -1, word_len)
f.read(12)
yield py_str, word_str
f.close()
def showtxt (records):
for (pystr, utf8str) in records:
print len(utf8str), utf8str
def store(records, f):
for (pystr, utf8str) in records:
f.write("%s\t1\n" %(utf8str.encode("utf8")))
f.write("x:1\n")
def main ():
if len (sys.argv) != 3:
print "Unknown Option \n usage: python %s file.scel new.txt" %(sys.argv[0])
exit (1)
#Specify the param of scel path as a directory, you can place many scel file in this dirctory, the this process will combine the result in one txt file
if os.path.isdir(sys.argv[1]):
for fileName in glob.glob(sys.argv[1] + '*.scel'):
print fileName
generator = get_word_from_sogou_cell_dict(fileName)
with open(sys.argv[2], "a") as f:
store(generator, f)
else:
generator = get_word_from_sogou_cell_dict (sys.argv[1])
with open(sys.argv[2], "w") as f:
store(generator, f)
#showtxt(generator)
if __name__ == "__main__":
main()

------------------------------------------------------------

阅读(2159) | 评论(0) | 转发(0) |

上一篇：python+shell简易进程控制

下一篇：更改sphinx0.9.9日志的时间格式

给主人留下些什么吧！~~

感谢所有关心和支持过ChinaUnix的朋友们

16024965号-6