解决Word文档的检索问题，lucene我的天职是搜索-luckfly-ChinaUnix博客

Chinaunix首页 | 论坛 | 博客

首页　| 　博文目录　| 　关于我

博客访问： 374370
博文数量： 152
博客积分： 6020
博客等级：准将
技术积分： 850
用户组：普通用户
注册时间： 2006-03-11 19:20

文章分类

全部博文（152）

oracle（3）
Javascript（56）

php（0）
图片（0）
古灵精怪（9）
Java（77）

Mysql（0）
未分配的博文（7）

文章存档

2017年（1）

2010年（1）

2007年（3）

2006年（147）

我的朋友

最近访客

推荐博文

相关博文

解决Word文档的检索问题，lucene我的天职是搜索

分类： BSD

2006-04-17 23:23:53

lunece是个姓氏，Lucene is Doug’s wife’s middle name; it’s also her maternal grandmother’s first name.
看了车东老大的blog，针对MSWord文档的解析器，因为Word文档和基于ASCII的RTF文档不同，
需要使用COM对象机制解析。其实apache的POI完全可以做到解析MSWord文档。
我修改了别人的一个例子，算是抛砖引玉，大家不要那转头打我。
Lucene并没有规定数据源的格式，而只提供了一个通用的结构（Document对象）来接受索引的输入，
但好像只能是文本数据。
package org.tatan.framework;

import java.io.PrintStream;
import java.io.PrintWriter;

public class DocumentHandlerException extends Exception {
private Throwable cause;

/**
   * Default constructor.
   */
public DocumentHandlerException() {
    super();
}

/**
   * Constructs with message.
   */
public DocumentHandlerException(String message) {
    super(message);
}

/**
   * Constructs with chained exception.
   */
public DocumentHandlerException(Throwable cause) {
    super(cause.toString());
    this.cause = cause;
}

/**
   * Constructs with message and exception.
   */
public DocumentHandlerException(String message, Throwable cause) {
    super(message, cause);
}

/**
   * Retrieves nested exception.
   */
public Throwable getException() {
    return cause;
}

public void printStackTrace() {
    printStackTrace(System.err);
}

public void printStackTrace(PrintStream ps) {
    synchronized (ps) {
      super.printStackTrace(ps);
      if (cause != null) {
        ps.println("--- Nested Exception ---");
        cause.printStackTrace(ps);
      }
    }
}

public void printStackTrace(PrintWriter pw) {
    synchronized (pw) {
      super.printStackTrace(pw);
      if (cause != null) {
        pw.println("--- Nested Exception ---");
        cause.printStackTrace(pw);
      }
    }
}
}
解析MSWORD的类
package org.tatan.framework;
import org.apache.poi.hdf.extractor.WordDocument;
import java.io.InputStream;
import java.io.StringWriter;
import java.io.PrintWriter;

public class POIWordDocHandler {

public String getDocument(InputStream is)
    throws DocumentHandlerException {

    String bodyText = null;

    try {
      WordDocument wd = new WordDocument(is);
      StringWriter docTextWriter = new StringWriter();
      wd.writeAllText(new PrintWriter(docTextWriter));
      docTextWriter.close();
      bodyText = docTextWriter.toString();
    }
    catch (Exception e) {
      throw new DocumentHandlerException(
        "Cannot extract text from a Word document", e);
    }

    if ((bodyText != null) && (bodyText.trim().length() > 0)) {

      return bodyText;
    }
    return null;
}

}

建立索引的类
package org.tatan.framework;

import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;

import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;

import java.util.Date;

public class Indexer {

public static void main(String[] args) throws Exception {

    File indexDir = new File("d:/testdoc/index");
    File dataDir = new File("d:/testdoc/msword");

    long start = new Date().getTime();
    int numIndexed = index(indexDir, dataDir);
    long end = new Date().getTime();

    System.out.println("Indexing " + numIndexed + " files took "
      + (end - start) + " milliseconds");
}

public static int index(File indexDir, File dataDir)
    throws Exception {

    if (!dataDir.exists() || !dataDir.isDirectory()) {
      throw new IOException(dataDir
        + " does not exist or is not a directory");
    }

     IndexWriter writer = new IndexWriter(indexDir,
      new CJKAnalyzer(), true)
    writer.setUseCompoundFile(false);

    indexDirectory(writer, dataDir);

    int numIndexed = writer.docCount();
    writer.optimize();
    writer.close();
    return numIndexed;
}

private static void indexDirectory(IndexWriter writer, File dir)
    throws Exception {

    File[] files = dir.listFiles();

    for (int i = 0; i < files.length; i++) {
      File f = files[i];
      if (f.isDirectory()) {
        indexDirectory(writer, f); // recurse
      } else if (f.getName().endsWith(".doc")) {
        indexFile(writer, f);
      }
    }
}

private static void indexFile(IndexWriter writer, File f)
    throws Exception {

    if (f.isHidden() || !f.exists() || !f.canRead()) {
      return;
    }

    System.out.println("Indexing " + f.getCanonicalPath());

    Document doc = new Document();
    POIWordDocHandler handler = new POIWordDocHandler();

    doc.add(Field.UnStored("body", handler.getDocument(new FileInputStream(f))));
    doc.add(Field.Keyword("filename", f.getCanonicalPath()));
    writer.addDocument(doc);
}
}

要注意的问题：使用Field对象UnStored函数，只全文索引，不存储。
检索的类
package org.tatan.framework;
import org.apache.lucene.document.Document;
import org.apache.lucene.queryParser.QueryParser;
import org.apache.lucene.search.Hits;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;

import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.analysis.Token;
import org.apache.lucene.analysis.cjk.CJKAnalyzer;

public class Searcher {
   public static void main(String[] args) throws Exception {

       Directory fsDir = FSDirectory.getDirectory("D:\\testdoc\\index", false);
            IndexSearcher is = new IndexSearcher(fsDir);

            Token[] tokens = AnalyzerUtils.tokensFromAnalysis(new CJKAnalyzer(), "一人一情");
            for (int i = 0; i < tokens.length; i++) {
           Query query = QueryParser.parse(tokens[i].termText(), "body", new CJKAnalyzer());

            Hits hits = is.search(query);

            for (int j = 0; j < hits.length(); j++) {
                Document doc = hits.doc(j);
                System.out.println(doc.get("filename"));
              }


            }
   }
}
要注意的问题：不要使用TermQuery检索不出中文，目前还有中文切词功能。

阅读(857) | 评论(0) | 转发(0) |

0

上一篇：java如何计算两个日期之间相差的天数

下一篇：spring下的各种连接池的比较

给主人留下些什么吧！~~

关于我们 | 关于IT168 | 联系方式 | 广告合作 | 法律声明 | 免费注册

Copyright 2001-2010 ChinaUnix.net All Rights Reserved 北京皓辰网域网络信息技术有限公司. 版权所有

感谢所有关心和支持过ChinaUnix的朋友们