Lucene-2.3.1 源代码阅读学习(31)-linxh-ChinaUnix博客

linxh

首页　| 　博文目录　| 　关于我

linxh

博客访问： 4497160
博文数量： 192
博客积分： 10014
博客等级：上将
技术积分： 8232
用户组：普通用户
注册时间： 2006-07-21 00:22

文章分类

全部博文（192）

开源项目（41）

weka（2）

lucene（39）
数据库（8）
工具（8）

cvs && svn（5）

emacs（1）

vim（2）
算法（2）
程序设计（82）

JavaScript（2）

PHP（0）

Java（40）

C/C++（9）

Ruby（3）

Python（16）

Perl（0）

Linux（2）

WIN32（2）

Boost（0）
网络（16）

Web（8）
操作系统（30）
安全（1）
资料（4）
未分配的博文（0）

文章存档

2011年（4）

2009年（14）

2008年（174）

我的朋友

最近访客

推荐博文

Lucene-2.3.1 源代码阅读学习(31)

分类：

2008-06-23 20:27:34

本文转自： http://daihaixiang.blog.163.com/blog/static/383013420084121155910/

关于前缀查询PrefixQuery(前缀查询)。

准备工作就是为指定的数据源文件建立索引。这里，我使用了ThesaurusAnalyzer分析器，该分析器有自己特定的词库，这个分词组件可以从网上下载。

PrefixQuery其实就是指定一个词条的前缀，不如以前缀“文件”作为前缀的词条有很多：文件系统、文件管理、文件类型等等。但，是在你要检索一个有指定的前缀构成的词条(只有一个前最也是一个词条)时，必须保证你在建立索引的时候，也就是分词生成的词条要有具有这个前缀构成的词条，否则什么也检索不出来。

Lucene中，指定某个前缀，检索过程中会以该前缀作为一个词条进行检索，比如“文件”前缀，如果词条文件中包含“文件”这个词条，而且有一个文件中只有一个句子：“我们要安全地管理好自己的文件。”使用PrefixQuery是也是可以检索出该文件的。

当然了，可以使用BooleanQuery对若干个查询子句进行组合，子句可以是TermQuery子句，可以是PrefixQuery子句，实现复杂查询。

先做个简单的例子，使用一下PrefixQuery。

测试主函数如下所示：

package org.apache.lucene.shirdrn.main;

import java.io.IOException;
import java.util.Date;
import java.util.Iterator;

import org.apache.lucene.document.Field;
import org.apache.lucene.index.CorruptIndexException;
import org.apache.lucene.index.Term;
import org.apache.lucene.search.Hit;
import org.apache.lucene.search.Hits;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.PrefixQuery;
import org.apache.lucene.search.Query;

public class PrefixQuerySearcher {

public static void main(String[] args) {

   String indexPath = "E:\\Lucene\\myindex";
   try {
    IndexSearcher searcher = new IndexSearcher(indexPath);

    String keywordPrefix = "文件";    //   就以“文件”作为前缀
    Term prefixTerm = new Term("contents",keywordPrefix);
    Query prefixQuery = new PrefixQuery(prefixTerm);
    Date startTime = new Date();
    Hits hits = searcher.search(prefixQuery);
    Iterator it = hits.iterator();
    System.out.println("********************************************************************");
    while(it.hasNext()){
     Hit hit = (Hit)it.next();
     System.out.println("Hit的ID 为： "+hit.getId());
     System.out.println("Hit的score 为： "+hit.getScore());
     System.out.println("Hit的boost 为： "+hit.getBoost());
     System.out.println("Hit的toString 为： "+hit.toString());
     System.out.println("Hit的Dcoment 为： "+hit.getDocument());
     System.out.println("Hit的Dcoment 的 Fields 为： "+hit.getDocument().getFields());
     for(int i=0;i      Field field = (Field)hit.getDocument().getFields().get(i);
      System.out.println("      -------------------------------------------------------------");
      System.out.println("      Field的Name为： "+field.name());
      System.out.println("      Field的stringValue为： "+field.stringValue());
     }
     System.out.println("********************************************************************");
    }
    System.out.println("满足指定前缀的Hits长度为： "+hits.length());
    Date finishTime = new Date();
    long timeOfSearch = finishTime.getTime() - startTime.getTime();
    System.out.println("本次搜索所用的时间为 "+timeOfSearch+" ms");
   } catch (CorruptIndexException e) {
    e.printStackTrace();
   } catch (IOException e) {
    e.printStackTrace();
   }
}
}

测试结果输出如下所示：

********************************************************************
Hit的ID 为： 41
Hit的score 为： 0.3409751
Hit的boost 为： 1.0
Hit的toString 为： Hit< [0] resolved>
Hit的Dcoment 为： Document stored/uncompressed,indexed>
Hit的Dcoment 的 Fields 为： [stored/uncompressed,indexed, stored/uncompressed,indexed]
      -------------------------------------------------------------
      Field的Name为： path
      Field的stringValue为： E:\Lucene\txt1\mytxt\Update.txt
      -------------------------------------------------------------
      Field的Name为： modified
      Field的stringValue为： 200707050028
********************************************************************
Hit的ID 为： 46
Hit的score 为： 0.3043366
Hit的boost 为： 1.0
Hit的toString 为： Hit< [1] resolved>
Hit的Dcoment 为： Document stored/uncompressed,indexed>
Hit的Dcoment 的 Fields 为： [stored/uncompressed,indexed, stored/uncompressed,indexed]
      -------------------------------------------------------------
      Field的Name为： path
      Field的stringValue为： E:\Lucene\txt1\mytxt\使用技巧集萃.txt
      -------------------------------------------------------------
      Field的Name为： modified
      Field的stringValue为： 200511210413
********************************************************************
Hit的ID 为： 24
Hit的score 为： 0.25827435
Hit的boost 为： 1.0
Hit的toString 为： Hit< [2] resolved>
Hit的Dcoment 为： Document stored/uncompressed,indexed>
Hit的Dcoment 的 Fields 为： [stored/uncompressed,indexed, stored/uncompressed,indexed]
      -------------------------------------------------------------
      Field的Name为： path
      Field的stringValue为： E:\Lucene\txt1\mytxt\FAQ.txt
      -------------------------------------------------------------
      Field的Name为： modified
      Field的stringValue为： 200604130754
********************************************************************
Hit的ID 为： 44
Hit的score 为： 0.23094007
Hit的boost 为： 1.0
Hit的toString 为： Hit< [3] resolved>
Hit的Dcoment 为： Document stored/uncompressed,indexed>
Hit的Dcoment 的 Fields 为： [stored/uncompressed,indexed, stored/uncompressed,indexed]
      -------------------------------------------------------------
      Field的Name为： path
      Field的stringValue为： E:\Lucene\txt1\mytxt\Visual Studio 2005注册升级.txt
      -------------------------------------------------------------
      Field的Name为： modified
      Field的stringValue为： 200801300512
********************************************************************
Hit的ID 为： 57
Hit的score 为： 0.16743648
Hit的boost 为： 1.0
Hit的toString 为： Hit< [4] resolved>
Hit的Dcoment 为： Document stored/uncompressed,indexed>
Hit的Dcoment 的 Fields 为： [stored/uncompressed,indexed, stored/uncompressed,indexed]
      -------------------------------------------------------------
      Field的Name为： path
      Field的stringValue为： E:\Lucene\txt1\mytxt\新建文本文档.txt
      -------------------------------------------------------------
      Field的Name为： modified
      Field的stringValue为： 200710270258
********************************************************************
Hit的ID 为： 12
Hit的score 为： 0.14527147
Hit的boost 为： 1.0
Hit的toString 为： Hit< [5] resolved>
Hit的Dcoment 为： Document stored/uncompressed,indexed>
Hit的Dcoment 的 Fields 为： [stored/uncompressed,indexed, stored/uncompressed,indexed]
      -------------------------------------------------------------
      Field的Name为： path
      Field的stringValue为： E:\Lucene\txt1\mytxt\CustomKeyInfo.txt
      -------------------------------------------------------------
      Field的Name为： modified
      Field的stringValue为： 200406041814
********************************************************************
Hit的ID 为： 63
Hit的score 为： 0.091877736
Hit的boost 为： 1.0
Hit的toString 为： Hit< [6] resolved>
Hit的Dcoment 为： Document stored/uncompressed,indexed>
Hit的Dcoment 的 Fields 为： [stored/uncompressed,indexed, stored/uncompressed,indexed]
      -------------------------------------------------------------
      Field的Name为： path
      Field的stringValue为： E:\Lucene\txt1\疑问即时记录.txt
      -------------------------------------------------------------
      Field的Name为： modified
      Field的stringValue为： 200711141408
********************************************************************
Hit的ID 为： 59
Hit的score 为： 0.08039302
Hit的boost 为： 1.0
Hit的toString 为： Hit< [7] resolved>
Hit的Dcoment 为： Document stored/uncompressed,indexed>
Hit的Dcoment 的 Fields 为： [stored/uncompressed,indexed, stored/uncompressed,indexed]
      -------------------------------------------------------------
      Field的Name为： path
      Field的stringValue为： E:\Lucene\txt1\mytxt\汉化说明.txt
      -------------------------------------------------------------
      Field的Name为： modified
      Field的stringValue为： 200708210247
********************************************************************
Hit的ID 为： 14
Hit的score 为： 0.020302303
Hit的boost 为： 1.0
Hit的toString 为： Hit< [8] resolved>
Hit的Dcoment 为： Document stored/uncompressed,indexed>
Hit的Dcoment 的 Fields 为： [stored/uncompressed,indexed, stored/uncompressed,indexed]
      -------------------------------------------------------------
      Field的Name为： path
      Field的stringValue为： E:\Lucene\txt1\mytxt\CustomKeysSample.txt
      -------------------------------------------------------------
      Field的Name为： modified
      Field的stringValue为： 200610100451
********************************************************************
满足指定前缀的Hits长度为： 9
本次搜索所用的时间为 297 ms

可以看出，包含前缀“文件”的查询结果，一共检索出9项符合条件。

关于以“文件”作为前缀(包含前缀“文件”)，在分析器ThesaurusAnalyzer分词组件的词库中具有下面的一些词条：

文件
文件匯編
文件名
文件夹
文件夾
文件尾
文件汇编
文件精神

假如有这样一种需求：想要检索全部以“文件”作为前缀的词条，而不想要单独出现的以“文件”作为词条的结果。

这时，可以指定一个TermQuery子句，再使用BooleanQuery实现。

在上面的测试主函数的基础上，添加如下代码：

    String keyword = "文件";
    Term term = new Term("contents",keyword);
    Query tQuery = new TermQuery(term);

    BooleanQuery bQuery = new BooleanQuery();
    bQuery.add(tQuery,BooleanClause.Occur.MUST_NOT);
    bQuery.add(prefixQuery,BooleanClause.Occur.MUST);

修改Hits hits = searcher.search(prefixQuery);为：

Hits hits = searcher.search(bQuery);

由于不包含单独的以“文件”作为词条的结果，所以使用MUST_NOT逻辑非运算符。

执行查询后，只匹配出一项，如下所示：

********************************************************************
Hit的ID 为： 44
Hit的score 为： 0.23393866
Hit的boost 为： 1.0
Hit的toString 为： Hit< [0] resolved>
Hit的Dcoment 为： Document stored/uncompressed,indexed>
Hit的Dcoment 的 Fields 为： [stored/uncompressed,indexed, stored/uncompressed,indexed]
      -------------------------------------------------------------
      Field的Name为： path
      Field的stringValue为： E:\Lucene\txt1\mytxt\Visual Studio 2005注册升级.txt
      -------------------------------------------------------------
      Field的Name为： modified
      Field的stringValue为： 200801300512
********************************************************************
满足指定前缀的Hits长度为： 1
本次搜索所用的时间为 187 ms

现在看一下PrefixQuery实现的源代码。在PrefixQuery中，只给出了一种构造方法：

private Term prefix;
public PrefixQuery(Term prefix) {
this.prefix = prefix;
}

它是通过一个Term作为参数构造的，非常容易掌握。

在PrefixQuery中有一个重要的rewrite()方法：

public Query rewrite(IndexReader reader) throws IOException {
    BooleanQuery query = new BooleanQuery(true);
    TermEnum enumerator = reader.terms(prefix);
    try {
      String prefixText = prefix.text();
      String prefixField = prefix.field();
      do {
        Term term = enumerator.term();
        if (term != null &&
            term.text().startsWith(prefixText) &&
            term.field() == prefixField)
        {
          TermQuery tq = new TermQuery(term);
          tq.setBoost(getBoost());
          query.add(tq, BooleanClause.Occur.SHOULD);      // 构造了一个BooleanQuery，向其中添加子句，个子句是逻辑或运算
        } else {
          break;
        }
      } while (enumerator.next());
    } finally {
      enumerator.close();
    }
    return query;
}

该方法通过打开一个IndexReader输入流，使用IndexReader的terms()方法获取到，以“给定前缀构造的词条”的所有词条。然后，以返回的这些词条构造多个TermQuery子句，再将这些子句添加到BooleanQuery中，返回一个新的Query(就是BooleanQuery)，这个BooleanQuery中的各个子句是逻辑或的关系，最后使用这个包含了多个子句的BooleanQuery实现复杂查询。

实际上，执行了多个TermQuery，然后将得到的结果集做SHOULD运算。

Lucene中，允许最大的子句上限是1024个，如果超过这个上限就会抛出异常。使用PrefixQuery的主要思想就是向一个BooleanQuery中添加多个参与SHOULD逻辑运算的TermQuery子句，感觉这里面有一个效率问题：对每个子句都进行执行的时候，如果子句的数量小效率还是不错，但是，如果有1000000个甚至更多的TermQuery子句被添加到BooleanQuery中，结果不会很乐观，而且需要重新设定Lucene中默认的最大子句上限，效率应该不能很好。

阅读(1384) | 评论(0) | 转发(0) |

上一篇：Lucene-2.3.1 源代码阅读学习(30)

下一篇：Lucene-2.3.1 源代码阅读学习(33)

给主人留下些什么吧！~~

感谢所有关心和支持过ChinaUnix的朋友们

16024965号-6