Chinaunix首页 | 论坛 | 博客
  • 博客访问: 374370
  • 博文数量: 152
  • 博客积分: 6020
  • 博客等级: 准将
  • 技术积分: 850
  • 用 户 组: 普通用户
  • 注册时间: 2006-03-11 19:20
文章分类

全部博文(152)

文章存档

2017年(1)

2010年(1)

2007年(3)

2006年(147)

我的朋友

分类: BSD

2006-04-17 23:23:53

lunece是个姓氏,Lucene is Doug’s wife’s middle name; it’s also her maternal grandmother’s first name.
看了车东老大的blog,针对MSWord文档的解析器,因为Word文档和基于ASCII的RTF文档不同,
需要使用COM对象机制解析。其实apache的POI完全可以做到解析MSWord文档。
我修改了别人的一个例子,算是抛砖引玉,大家不要那转头打我。
Lucene并没有规定数据源的格式,而只提供了一个通用的结构(Document对象)来接受索引的输入,
但好像只能是文本数据。
package org.tatan.framework;

import java.io.PrintStream;
import java.io.PrintWriter;

public class DocumentHandlerException extends Exception {
  private Throwable cause;

  /**
   * Default constructor.
   */
  public DocumentHandlerException() {
    super();
  }

  /**
   * Constructs with message.
   */
  public DocumentHandlerException(String message) {
    super(message);
  }

  /**
   * Constructs with chained exception.
   */
  public DocumentHandlerException(Throwable cause) {
    super(cause.toString());
    this.cause = cause;
  }

  /**
   * Constructs with message and exception.
   */
  public DocumentHandlerException(String message, Throwable cause) {
    super(message, cause);
  }

  /**
   * Retrieves nested exception.
   */
  public Throwable getException() {
    return cause;
  }

  public void printStackTrace() {
    printStackTrace(System.err);
  }

  public void printStackTrace(PrintStream ps) {
    synchronized (ps) {
      super.printStackTrace(ps);
      if (cause != null) {
        ps.println("--- Nested Exception ---");
        cause.printStackTrace(ps);
      }
    }
  }

  public void printStackTrace(PrintWriter pw) {
    synchronized (pw) {
      super.printStackTrace(pw);
      if (cause != null) {
        pw.println("--- Nested Exception ---");
        cause.printStackTrace(pw);
      }
    }
  }
}
解析MSWORD的类
package org.tatan.framework;
import org.apache.poi.hdf.extractor.WordDocument;
import java.io.InputStream;
import java.io.StringWriter;
import java.io.PrintWriter;

public class POIWordDocHandler  {

  public String getDocument(InputStream is)
    throws DocumentHandlerException {

    String bodyText = null;

    try {
      WordDocument wd = new WordDocument(is);
      StringWriter docTextWriter = new StringWriter();
      wd.writeAllText(new PrintWriter(docTextWriter));
      docTextWriter.close();
      bodyText = docTextWriter.toString();
    }
    catch (Exception e) {
      throw new DocumentHandlerException(
        "Cannot extract text from a Word document", e);
    }

    if ((bodyText != null) && (bodyText.trim().length() > 0)) {
     
      return bodyText;
    }
    return null;
  }

 
}

建立索引的类
package org.tatan.framework;

import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;

import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;

import java.util.Date;


public class Indexer {

  public static void main(String[] args) throws Exception {
    
    File indexDir = new File("d:/testdoc/index");
    File dataDir = new File("d:/testdoc/msword");

    long start = new Date().getTime();
    int numIndexed = index(indexDir, dataDir);
    long end = new Date().getTime();

    System.out.println("Indexing " + numIndexed + " files took "
      + (end - start) + " milliseconds");
  }

  public static int index(File indexDir, File dataDir)
    throws Exception {

    if (!dataDir.exists() || !dataDir.isDirectory()) {
      throw new IOException(dataDir
        + " does not exist or is not a directory");
    }

     IndexWriter writer = new IndexWriter(indexDir,
      new CJKAnalyzer(), true)
    writer.setUseCompoundFile(false);

    indexDirectory(writer, dataDir);

    int numIndexed = writer.docCount();
    writer.optimize();
    writer.close();
    return numIndexed;
  }

  private static void indexDirectory(IndexWriter writer, File dir)
    throws Exception {

    File[] files = dir.listFiles();

    for (int i = 0; i < files.length; i++) {
      File f = files[i];
      if (f.isDirectory()) {
        indexDirectory(writer, f);  // recurse
      } else if (f.getName().endsWith(".doc")) {
        indexFile(writer, f);
      }
    }
  }

  private static void indexFile(IndexWriter writer, File f)
    throws Exception {

    if (f.isHidden() || !f.exists() || !f.canRead()) {
      return;
    }

    System.out.println("Indexing " + f.getCanonicalPath());

    Document doc = new Document();
    POIWordDocHandler handler = new POIWordDocHandler();
   
    doc.add(Field.UnStored("body", handler.getDocument(new FileInputStream(f))));
    doc.add(Field.Keyword("filename", f.getCanonicalPath()));
    writer.addDocument(doc);
  }
}

要注意的问题:使用Field对象UnStored函数,只全文索引,不存储。
检索的类
package org.tatan.framework;
import org.apache.lucene.document.Document;
import org.apache.lucene.queryParser.QueryParser;
import org.apache.lucene.search.Hits;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;

import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.analysis.Token;
import org.apache.lucene.analysis.cjk.CJKAnalyzer;


public class Searcher {
     public static void main(String[] args) throws Exception {
        
         Directory fsDir = FSDirectory.getDirectory("D:\\testdoc\\index", false);
            IndexSearcher is = new IndexSearcher(fsDir);
            
            Token[] tokens = AnalyzerUtils.tokensFromAnalysis(new CJKAnalyzer(), "一人一情");
            for (int i = 0; i < tokens.length; i++) {
           Query query = QueryParser.parse(tokens[i].termText(), "body", new CJKAnalyzer());
        
            Hits hits = is.search(query);
            
            for (int j = 0; j < hits.length(); j++) {
                Document doc = hits.doc(j);
                System.out.println(doc.get("filename"));
              }
           
            
            }
     }
}
要注意的问题:不要使用TermQuery检索不出中文,目前还有中文切词功能。
阅读(857) | 评论(0) | 转发(0) |
给主人留下些什么吧!~~