Chinaunix首页 | 论坛 | 博客
  • 博客访问: 94501
  • 博文数量: 40
  • 博客积分: 651
  • 博客等级: 上士
  • 技术积分: 356
  • 用 户 组: 普通用户
  • 注册时间: 2011-08-08 22:31
文章分类

全部博文(40)

文章存档

2013年(6)

2012年(3)

2011年(31)

我的朋友

分类: Java

2013-03-22 23:49:08

Analyzer 定义了从文本中抽取词的一组规范,
首先要实现一个Tokenizer,这个类会把输入流中的字符串切分成原始的词元。
    这里
所谓TokenStream,后面我们会讲到,是一个由分词后的Token结果组成的流,能够不断的得到下一个分成的Token。为了提高性能,使得在同一个线程中无需再生成新的TokenStream对象,老的可以被重用,所以有reusableTokenStream一说。
然后多个TokenFilter 就能够将这些词元规范化得到分词的结果。
   为了定义文本分析器的行为,子类必须在 .方法中 , TokenStreamComponents随后会在 .中得到重用。
Simple example:
 
  1. Analyzer analyzer = new Analyzer() {
  2.   @Override
  3.    protected TokenStreamComponents createComponents(String fieldName, Reader reader) {
  4.      Tokenizer source = new FooTokenizer(reader);
  5.      TokenStream filter = new FooFilter(source);
  6.      filter = new BarFilter(filter);
  7.      return new TokenStreamComponents(source, filter);
  8.    }
  9.  };
For more examples, see the Analysis package documentation.

For some concrete implementations bundled with Lucene, look in the analysis modules:

  • : Analyzers for indexing content in different languages and domains.
  • ICU: Exposes functionality from ICU to Apache Lucene.
  • : Morphological analyzer for Japanese text.
  • : Dictionary-driven lemmatization for the Polish language.
  • : Analysis for indexing phonetic signatures (for sounds-alike search).
  • : Analyzer for Simplified Chinese, which indexes words.
  • : Algorithmic Stemmer for the Polish Language.
  • : Analysis integration with Apache UIMA.

要实现一种Lucene的分析器(Analyzer),至少要实现一个

分词器(Tokenizer)。对于特定语言来说,必要的过滤器(TokenFilter)也是不可缺少的。其中过滤器有很多种,主要可以用来对分词结果进行标准化。比如去停用词、转换大小写、英文的词干化(stemming)和词类归并 (lemmatization)等等。具体实现可以参考前两节的内容,这里贴出analysis一个子类的实现。


点击(此处)折叠或打开

  1. public class FMMAnalyzer extends Analyzer{

  2.     
  3.     public static DictManager dictManager;
  4.     
  5.     static {
  6.         initDict();
  7.     }
  8.     
  9.     /**
  10.      * An array containing some common English words that are not usually useful
  11.      * for searching. and some double-byte interpunctions.....
  12.      */
  13.     private static String[] stopWords = { "a", "and", "are", "as", "at", "be",
  14.             "but", "by", "for", "if", "in", "into", "is", "it", "no", "not", "of",
  15.             "on", "or", "s", "such", "t", "that", "the", "their", "then", "there",
  16.             "these", "they", "this", "to", "was", "will", "with", "", "www" };
  17.     
  18.     
  19.     // ~ Instance fields --------------------------------------------------------

  20.     /** stop word list */
  21.     // private Hashtable stopTable;
  22.     @SuppressWarnings("unchecked")
  23.     private CharArraySet stopSet;

  24.     
  25.     /**
  26.      * Builds an analyzer which removes words in STOP_WORDS.
  27.      */
  28.     public FMMAnalyzer() {
  29.         this(stopWords);
  30.     }
  31.     
  32.     /**
  33.      * Builds an analyzer which removes words in the provided array.
  34.      *
  35.      * @param stopWords
  36.      * stop word array
  37.      */
  38.     public FMMAnalyzer(String[] stopWords) {
  39.         // stopTable = StopFilter.makeStopTable(stopWords);
  40.         stopSet = StopFilter.makeStopSet(Version.LUCENE_42,stopWords);
  41.     }

  42.     public static void initDict() {
  43.         if (dictManager == null) {
  44.             dictManager = new DictManager();
  45.         }
  46.     }

  47.     // ~ Methods ----------------------------------------------------------------

  48.     /**
  49.      * get token stream from input
  50.      *
  51.      * @param fieldName
  52.      * lucene field name
  53.      * @param reader
  54.      * input reader
  55.      *
  56.      * @return TokenStream
  57.      */
  58.  
  59.  
  60.     @Override
  61.     protected TokenStreamComponents createComponents(String arg0, Reader arg1) {
  62.     
  63.          final Tokenizer source = new DefaultTokenizer(Version.LUCENE_42,arg1);
  64.         
  65.          TokenStream filter = new StandardFilter(source);
  66.         
  67.          final TokenStreamComponents tokenstream = new TokenStreamComponents(source,filter);
  68.         
  69.          /**
  70.          try {
  71.          source.close();
  72.          } catch (IOException e) {
  73.          //log.error(e.getLocalizedMessage());
  74.          e.printStackTrace();
  75.          }
  76.          **/
  77.          return tokenstream;
  78.         
  79.     }
  80.     

  81. }

首先会调用静态方法来创建一个分词字典管理,


点击(此处)折叠或打开

  1. import java.io.BufferedReader;
  2. import java.io.IOException;
  3. import java.io.InputStream;
  4. import java.io.InputStreamReader;

  5. import org.apache.commons.logging.Log;
  6. import org.apache.commons.logging.LogFactory;
  7.   

  8. /**
  9.  * 分词字典管理
  10.  *
  11.  * @author lvzf 305036301@qq.com
  12.  *
  13.  */
  14. public class DictManager {
  15.     /**
  16.      * 语料库的存放地
  17.      */
  18.     private String _dictFiles = "dict/snouse.txt,dict/snumbers.txt,dict/splace.txt,dict/susername.txt,dict/sword.txt,dict/sconj.txt";

  19.     /**
  20.      * 普通的Word存放
  21.      */
  22.     private PhraseDict wordMap;

  23.     /**
  24.      * 地方的Word存放
  25.      */
  26.     private PhraseDict placeMap;

  27.     /**
  28.      * 名字的Word存放
  29.      */
  30.     private PhraseDict usernameMap;

  31.     /**
  32.      * 无用的Word存放
  33.      */
  34.     private PhraseDict nouseMap;

  35.     /**
  36.      * 数字的Word存放
  37.      */
  38.     private PhraseDict numberMap;

  39.     /**
  40.      * 连词的Word存放
  41.      */
  42.     private PhraseDict conjMap;

  43.     /**
  44.      * 日志记录
  45.      */
  46.     private static final Log log = LogFactory.getLog(DictManager.class);

  47.     public DictManager() {

  48.         if (wordMap != null && placeMap != null && usernameMap != null
  49.                 && nouseMap != null && numberMap != null && conjMap != null) {
  50.             return;
  51.         }

  52.         //String[] str = StringUtils.delimitedListToStringArray(getDictFiles(), ",");
  53.         String[] str = getDictFiles().split(",");
  54.         for (int i = 0; i < str.length; i++) {
  55.             if (str[i].indexOf("word") > -1) {
  56.                 wordMap = parseDict(ClassLoaderUtil.getStream(str[i]));
  57.             } else if (str[i].indexOf("place") > -1) {
  58.                 placeMap = parseDict(ClassLoaderUtil.getStream(str[i]));
  59.             } else if (str[i].indexOf("username") > -1) {
  60.                 usernameMap = parseDict(ClassLoaderUtil.getStream(str[i]));
  61.             } else if (str[i].indexOf("nouse") > -1) {
  62.                 nouseMap = parseDict(ClassLoaderUtil.getStream(str[i]));
  63.             } else if (str[i].indexOf("number") > -1) {
  64.                 numberMap = parseDict(ClassLoaderUtil.getStream(str[i]));
  65.             } else if (str[i].indexOf("conj") > -1) {
  66.                 conjMap = parseDict(ClassLoaderUtil.getStream(str[i]));
  67.             }
  68.         }
  69.     }
  70.     
  71.     /**
  72.      * 开始解析字典内容,并把结果放在HashMap中
  73.      */
  74.     private PhraseDict parseDict(InputStream is) {
  75.         if (is == null) {
  76.             return null;
  77.         }
  78.         PhraseDict dictMap = new PhraseDict();

  79.         BufferedReader br = null;
  80.         InputStreamReader fr = null;

  81.         try {
  82.             fr = new InputStreamReader(is, "UTF-8");
  83.             br = new BufferedReader(fr);
  84.             String line;
  85.             while ((line = br.readLine()) != null) {
  86.                 if (line.startsWith("#") || line.equals("")) {
  87.                     continue;
  88.                 }
  89.                 // 把内容放在内容的第一个字的开头。如 欢迎 则 放在Key为欢的Map中
  90.                 dictMap.put(line.charAt(0), line);
  91.             }
  92.         } catch (IOException e) {
  93.             e.printStackTrace();
  94.         } finally {
  95.             if (is != null) {
  96.                 try {
  97.                     is.close();
  98.                 } catch (IOException e) {
  99.                     if (log.isDebugEnabled()) {
  100.                         log.debug("关闭InputStream的时候出现异常");
  101.                     }
  102.                     e.printStackTrace();
  103.                 }
  104.             }
  105.             if (fr != null) {
  106.                 try {
  107.                     fr.close();
  108.                 } catch (IOException e) {
  109.                     if (log.isDebugEnabled()) {
  110.                         log.debug("关闭FileReader的时候出现异常");
  111.                     }
  112.                     e.printStackTrace();
  113.                 }
  114.             }
  115.             if (br != null) {
  116.                 try {
  117.                     br.close();
  118.                 } catch (IOException e) {
  119.                     if (log.isDebugEnabled()) {
  120.                         log.debug("关闭BufferedReader的时候出现异常");
  121.                     }
  122.                     e.printStackTrace();
  123.                 }
  124.             }
  125.         }

  126.         return dictMap;
  127.     }

  128.     /**
  129.      * @return Returns the dictFiles.
  130.      */
  131.     public final String getDictFiles() {
  132.         return _dictFiles;
  133.     }

  134.     /**
  135.      * @param dictFiles
  136.      * The dictFiles to set.
  137.      */
  138.     public final void setDictFiles(String dictFiles) {
  139.         _dictFiles = dictFiles;
  140.     }

  141.     /**
  142.      * @return Returns the nouseMap.
  143.      */
  144.     public final PhraseDict getNouseMap() {
  145.         return nouseMap;
  146.     }

  147.     /**
  148.      * @return Returns the numberMap.
  149.      */
  150.     public final PhraseDict getNumberMap() {
  151.         return numberMap;
  152.     }

  153.     /**
  154.      * @return Returns the placeMap.
  155.      */
  156.     public final PhraseDict getPlaceMap() {
  157.         return placeMap;
  158.     }

  159.     /**
  160.      * @return Returns the usernameMap.
  161.      */
  162.     public final PhraseDict getUsernameMap() {
  163.         return usernameMap;
  164.     }

  165.     /**
  166.      * @return Returns the wordMap.
  167.      */
  168.     public final PhraseDict getWordMap() {
  169.         return wordMap;
  170.     }

  171.     /**
  172.      * @return Returns the conjMap.
  173.      */
  174.     public final PhraseDict getConjMap() {
  175.         return conjMap;
  176.     }

  177. }







阅读(3697) | 评论(0) | 转发(0) |
给主人留下些什么吧!~~