From:
------------------------------------------------------------
将搜狗(sogou)的细胞词库转换为mmseg的词库
功能:
-
scel2mmseg.py: 将.scel文件转换为mmseg格式的.txt文件
使用方法: python scel2mmseg.py a.scel a.txt
批量转换方法:python scel2mmseg.py scel文件目录 a.txt
说明:新增加的所有词的词频都为1,对于格式的解释如下:[摘自 ]
每条记录分两行。其中,第一行为词项,其格式为:[词条]\t[词频率]。需要注意的是,对于单个字后面跟这个字作单字成词的频率,这个频率需要在
大量的预先切分好的语料库中进行统计,用户增加或删除词时,一般不需要修改这个数值;对于非单字词,词频率处必须为1。第二行为占位项,是由于
LibMMSeg库的代码是从Coreseek其他的分词算法库(N-gram模型)中改造而来的,在原来的应用中,第二行为该词在各种词性下的分布频
率。LibMMSeg的用户只需要简单的在第二行处填”x:1″即可
-
mergedict.py: 将mmseg的多个.txt文件合并为一个.txt
使用方法: python mergedict.py unigram.txt b.txt c.txt new.txt
说明: .txt可以使mmseg格式的,也可以是每行一个词的格式(这样词频默认为1)
注意:因为merge的时候会判重,一个词在前面出现过,就不会追加到新产生的文件中,所以要将unigram.txt放到最前面
------------------------------------------------------------
scel2mmseg.py:
------------------------------------------------------------
-
import struct
-
import os, sys, glob
-
-
def read_utf16_str (f, offset=-1, len=2):
-
if offset >= 0:
-
f.seek(offset)
-
str = f.read(len)
-
return str.decode('UTF-16LE')
-
-
def read_uint16 (f):
-
return struct.unpack (', f.read(2))[0]
-
-
def get_word_from_sogou_cell_dict (fname):
-
f = open (fname, 'rb')
-
file_size = os.path.getsize (fname)
-
-
hz_offset = 0
-
mask = struct.unpack ('B', f.read(128)[4])[0]
-
if mask == 0x44:
-
hz_offset = 0x2628
-
elif mask == 0x45:
-
hz_offset = 0x26c4
-
else:
-
sys.exit(1)
-
-
title = read_utf16_str (f, 0x130, 0x338 - 0x130)
-
type = read_utf16_str (f, 0x338, 0x540 - 0x338)
-
desc = read_utf16_str (f, 0x540, 0xd40 - 0x540)
-
samples = read_utf16_str (f, 0xd40, 0x1540 - 0xd40)
-
-
py_map = {}
-
f.seek(0x1540+4)
-
-
while 1:
-
py_code = read_uint16 (f)
-
py_len = read_uint16 (f)
-
py_str = read_utf16_str (f, -1, py_len)
-
-
if py_code not in py_map:
-
py_map[py_code] = py_str
-
-
if py_str == 'zuo':
-
break
-
-
f.seek(hz_offset)
-
while f.tell() != file_size:
-
word_count = read_uint16 (f)
-
pinyin_count = read_uint16 (f) / 2
-
-
py_set = []
-
for i in range(pinyin_count):
-
py_id = read_uint16(f)
-
py_set.append(py_map[py_id])
-
py_str = "'".join (py_set)
-
-
for i in range(word_count):
-
word_len = read_uint16(f)
-
word_str = read_utf16_str (f, -1, word_len)
-
f.read(12)
-
yield py_str, word_str
-
-
f.close()
-
-
def showtxt (records):
-
for (pystr, utf8str) in records:
-
print len(utf8str), utf8str
-
-
def store(records, f):
-
for (pystr, utf8str) in records:
-
f.write("%s\t1\n" %(utf8str.encode("utf8")))
-
f.write("x:1\n")
-
-
def main ():
-
if len (sys.argv) != 3:
-
print "Unknown Option \n usage: python %s file.scel new.txt" %(sys.argv[0])
-
exit (1)
-
-
#Specify the param of scel path as a directory, you can place many scel file in this dirctory, the this process will combine the result in one txt file
-
if os.path.isdir(sys.argv[1]):
-
for fileName in glob.glob(sys.argv[1] + '*.scel'):
-
print fileName
-
generator = get_word_from_sogou_cell_dict(fileName)
-
with open(sys.argv[2], "a") as f:
-
store(generator, f)
-
-
else:
-
generator = get_word_from_sogou_cell_dict (sys.argv[1])
-
with open(sys.argv[2], "w") as f:
-
store(generator, f)
-
#showtxt(generator)
-
-
if __name__ == "__main__":
-
main()
------------------------------------------------------------
阅读(2151) | 评论(0) | 转发(0) |