These days, i have completed a convertor from pinyin to hanzi, using trigram and Viterbi decoding algorithm. It is written in C and Perl, which(Perl) is used to parse the raw corpus from the WWW. It is diffcult to compress the Langusge Model(LM), i just cut off the trigrams which count is less than 3.
The next step is do some test of this system and do some back-off based Entropy so as to compress the LM. There are a lot of hard work to be done.
阅读(831) | 评论(1) | 转发(0) |