【三】文本分析器Analyzer -www

灰太~

首页　| 　博文目录　| 　关于我

www_546

博客访问： 95198
博文数量： 40
博客积分： 651
博客等级：上士
技术积分： 356
用户组：普通用户
注册时间： 2011-08-08 22:31

文章分类

全部博文（40）

生活（1）
db（2）
js（4）
java（9）

lucene（4）

ctrs（1）
Android（10）

基础知识（3）
flex（13）

flex优化（3）

flex技巧（8）
未分配的博文（1）

文章存档

2013年（6）

2012年（3）

2011年（31）

我的朋友

sinkingb

相关博文

【三】文本分析器Analyzer

分类： Java

2013-03-22 23:49:08

Analyzer 定义了从文本中抽取词的一组规范,
首先要实现一个Tokenizer，这个类会把输入流中的字符串切分成原始的词元。
这里所谓TokenStream，后面我们会讲到，是一个由分词后的Token结果组成的流，能够不断的得到下一个分成的Token。为了提高性能，使得在同一个线程中无需再生成新的TokenStream对象，老的可以被重用，所以有reusableTokenStream一说。
然后多个TokenFilter 就能够将这些词元规范化得到分词的结果。
为了定义文本分析器的行为，子类必须在 .方法中 , TokenStreamComponents随后会在 .中得到重用。
Simple example:

Analyzer analyzer = new Analyzer() {
@Override
protected TokenStreamComponents createComponents(String fieldName, Reader reader) {
Tokenizer source = new FooTokenizer(reader);
TokenStream filter = new FooFilter(source);
filter = new BarFilter(filter);
return new TokenStreamComponents(source, filter);
}
};

For more examples, see the Analysis package documentation.

For some concrete implementations bundled with Lucene, look in the analysis modules:

: Analyzers for indexing content in different languages and domains.
ICU: Exposes functionality from ICU to Apache Lucene.
: Morphological analyzer for Japanese text.
: Dictionary-driven lemmatization for the Polish language.
: Analysis for indexing phonetic signatures (for sounds-alike search).
: Analyzer for Simplified Chinese, which indexes words.
: Algorithmic Stemmer for the Polish Language.
: Analysis integration with Apache UIMA.

要实现一种Lucene的分析器(Analyzer)，至少要实现一个

分词器(Tokenizer)。对于特定语言来说，必要的过滤器(TokenFilter)也是不可缺少的。其中过滤器有很多种，主要可以用来对分词结果进行标准化。比如去停用词、转换大小写、英文的词干化(stemming)和词类归并 (lemmatization)等等。具体实现可以参考前两节的内容,这里贴出analysis一个子类的实现。

点击(此处)折叠或打开

public class FMMAnalyzer extends Analyzer{
public static DictManager dictManager;
static {
initDict();
}
/**
* An array containing some common English words that are not usually useful
* for searching. and some double-byte interpunctions.....
*/
private static String[] stopWords = { "a", "and", "are", "as", "at", "be",
"but", "by", "for", "if", "in", "into", "is", "it", "no", "not", "of",
"on", "or", "s", "such", "t", "that", "the", "their", "then", "there",
"these", "they", "this", "to", "was", "will", "with", "", "www" };
// ~ Instance fields --------------------------------------------------------
/** stop word list */
// private Hashtable stopTable;
@SuppressWarnings("unchecked")
private CharArraySet stopSet;
/**
* Builds an analyzer which removes words in STOP_WORDS.
*/
public FMMAnalyzer() {
this(stopWords);
}
/**
* Builds an analyzer which removes words in the provided array.
*
* @param stopWords
* stop word array
*/
public FMMAnalyzer(String[] stopWords) {
// stopTable = StopFilter.makeStopTable(stopWords);
stopSet = StopFilter.makeStopSet(Version.LUCENE_42,stopWords);
}
public static void initDict() {
if (dictManager == null) {
dictManager = new DictManager();
}
}
// ~ Methods ----------------------------------------------------------------
/**
* get token stream from input
*
* @param fieldName
* lucene field name
* @param reader
* input reader
*
* @return TokenStream
*/
@Override
protected TokenStreamComponents createComponents(String arg0, Reader arg1) {
final Tokenizer source = new DefaultTokenizer(Version.LUCENE_42,arg1);
TokenStream filter = new StandardFilter(source);
final TokenStreamComponents tokenstream = new TokenStreamComponents(source,filter);
/**
try {
source.close();
} catch (IOException e) {
//log.error(e.getLocalizedMessage());
e.printStackTrace();
}
**/
return tokenstream;
}
}

首先会调用静态方法来创建一个分词字典管理，

点击(此处)折叠或打开

import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStream;
import java.io.InputStreamReader;
import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;
/**
* 分词字典管理
*
* @author lvzf 305036301@qq.com
*
*/
public class DictManager {
/**
* 语料库的存放地
*/
private String _dictFiles = "dict/snouse.txt,dict/snumbers.txt,dict/splace.txt,dict/susername.txt,dict/sword.txt,dict/sconj.txt";
/**
* 普通的Word存放
*/
private PhraseDict wordMap;
/**
* 地方的Word存放
*/
private PhraseDict placeMap;
/**
* 名字的Word存放
*/
private PhraseDict usernameMap;
/**
* 无用的Word存放
*/
private PhraseDict nouseMap;
/**
* 数字的Word存放
*/
private PhraseDict numberMap;
/**
* 连词的Word存放
*/
private PhraseDict conjMap;
/**
* 日志记录
*/
private static final Log log = LogFactory.getLog(DictManager.class);
public DictManager() {
if (wordMap != null && placeMap != null && usernameMap != null
&& nouseMap != null && numberMap != null && conjMap != null) {
return;
}
//String[] str = StringUtils.delimitedListToStringArray(getDictFiles(), ",");
String[] str = getDictFiles().split(",");
for (int i = 0; i < str.length; i++) {
if (str[i].indexOf("word") > -1) {
wordMap = parseDict(ClassLoaderUtil.getStream(str[i]));
} else if (str[i].indexOf("place") > -1) {
placeMap = parseDict(ClassLoaderUtil.getStream(str[i]));
} else if (str[i].indexOf("username") > -1) {
usernameMap = parseDict(ClassLoaderUtil.getStream(str[i]));
} else if (str[i].indexOf("nouse") > -1) {
nouseMap = parseDict(ClassLoaderUtil.getStream(str[i]));
} else if (str[i].indexOf("number") > -1) {
numberMap = parseDict(ClassLoaderUtil.getStream(str[i]));
} else if (str[i].indexOf("conj") > -1) {
conjMap = parseDict(ClassLoaderUtil.getStream(str[i]));
}
}
}
/**
* 开始解析字典内容,并把结果放在HashMap中
*/
private PhraseDict parseDict(InputStream is) {
if (is == null) {
return null;
}
PhraseDict dictMap = new PhraseDict();
BufferedReader br = null;
InputStreamReader fr = null;
try {
fr = new InputStreamReader(is, "UTF-8");
br = new BufferedReader(fr);
String line;
while ((line = br.readLine()) != null) {
if (line.startsWith("#") || line.equals("")) {
continue;
}
// 把内容放在内容的第一个字的开头。如欢迎则放在Key为欢的Map中
dictMap.put(line.charAt(0), line);
}
} catch (IOException e) {
e.printStackTrace();
} finally {
if (is != null) {
try {
is.close();
} catch (IOException e) {
if (log.isDebugEnabled()) {
log.debug("关闭InputStream的时候出现异常");
}
e.printStackTrace();
}
}
if (fr != null) {
try {
fr.close();
} catch (IOException e) {
if (log.isDebugEnabled()) {
log.debug("关闭FileReader的时候出现异常");
}
e.printStackTrace();
}
}
if (br != null) {
try {
br.close();
} catch (IOException e) {
if (log.isDebugEnabled()) {
log.debug("关闭BufferedReader的时候出现异常");
}
e.printStackTrace();
}
}
}
return dictMap;
}
/**
* @return Returns the dictFiles.
*/
public final String getDictFiles() {
return _dictFiles;
}
/**
* @param dictFiles
* The dictFiles to set.
*/
public final void setDictFiles(String dictFiles) {
_dictFiles = dictFiles;
}
/**
* @return Returns the nouseMap.
*/
public final PhraseDict getNouseMap() {
return nouseMap;
}
/**
* @return Returns the numberMap.
*/
public final PhraseDict getNumberMap() {
return numberMap;
}
/**
* @return Returns the placeMap.
*/
public final PhraseDict getPlaceMap() {
return placeMap;
}
/**
* @return Returns the usernameMap.
*/
public final PhraseDict getUsernameMap() {
return usernameMap;
}
/**
* @return Returns the wordMap.
*/
public final PhraseDict getWordMap() {
return wordMap;
}
/**
* @return Returns the conjMap.
*/
public final PhraseDict getConjMap() {
return conjMap;
}
}

阅读(3706) | 评论(0) | 转发(0) |

上一篇：【三】正向最大匹配法

下一篇：myabatis实现物理分页

给主人留下些什么吧！~~

感谢所有关心和支持过ChinaUnix的朋友们

16024965号-6