Lucene-2.3.1 源代码阅读学习(34)-linxh-ChinaUnix博客

linxh

首页　| 　博文目录　| 　关于我

linxh

博客访问： 4497047
博文数量： 192
博客积分： 10014
博客等级：上将
技术积分： 8232
用户组：普通用户
注册时间： 2006-07-21 00:22

文章分类

全部博文（192）

开源项目（41）

weka（2）

lucene（39）
数据库（8）
工具（8）

cvs && svn（5）

emacs（1）

vim（2）
算法（2）
程序设计（82）

JavaScript（2）

PHP（0）

Java（40）

C/C++（9）

Ruby（3）

Python（16）

Perl（0）

Linux（2）

WIN32（2）

Boost（0）
网络（16）

Web（8）
操作系统（30）
安全（1）
资料（4）
未分配的博文（0）

文章存档

2011年（4）

2009年（14）

2008年（174）

我的朋友

最近访客

推荐博文

Lucene-2.3.1 源代码阅读学习(34)

分类：

2008-06-23 20:31:35

本文转自： http://daihaixiang.blog.163.com/blog/static/383013420084121413523/

关于PhraseQuery。

PhraseQuery查询是将多个短语进行合并，得到一个新的词条，从索引库中检索出这个复杂的词条所对应的目标数据文件。

举个例子：假如用户输入关键字“网络安全”，如果索引库中没有单独的“网络安全”这个词条，但是具有“网络”和“安全”这两个词条，我们可以使用PhraseQuery进行查询，将“网络”和“安全”这两个词条合并后能够检索出匹配“网络安全”的所有词条对应的结果集。

现在，使用StandardAnalyzer分析器，对目标数据进行建立索引，也就是，把单独的每个汉字都作为一个词条，存储到索引文件中。可想而知，建立索引花费的时间可能会比较多，因为要对单个汉字进行Tokenizer。

测试程序使用“文件”这个词条，因为使用StandardAnalyzer分析器，索引库中没有词条“文件”，我们使用PhraseQuery来构造实现检索关键字“文件”。

测试主函数如下所示：

package org.apache.lucene.shirdrn.main;

import java.io.IOException;
import java.util.Date;
import java.util.List;

import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.index.CorruptIndexException;
import org.apache.lucene.index.Term;
import org.apache.lucene.search.Hits;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.PhraseQuery;

public class PhraseQuerySearcher {

public static void main(String[] args) {
   String path = "E:\\Lucene\\myindex";
   String keywordA = "文";
   Term termA = new Term("contents",keywordA);

   String keywordB = "件";
   Term termB = new Term("contents",keywordB);

// 根据上面搜索关键字构造的两个词条，将它们添加到PhraseQuery中，进行检索

   PhraseQuery phraseQuery = new PhraseQuery();
   phraseQuery.add(termA);
   phraseQuery.add(termB);

   try {
    Date startTime = new Date();
    IndexSearcher searcher = new IndexSearcher(path);
    Hits hits = searcher.search(phraseQuery);
    for(int i=0;i     System.out.println("Document的内部编号为： "+hits.id(i));
     Document doc = hits.doc(i);
     System.out.println("Document的得分为： "+hits.score(i));
     List fieldList = doc.getFields();
     System.out.println("Document(编号) "+hits.id(i)+" 的Field的信息： ");
     for(int j=0;j      Field field = (Field)fieldList.get(j);
      System.out.println("    Field的name : "+field.name());
      System.out.println("    Field的stringValue : "+field.stringValue());
      System.out.println("    ------------------------------------");
     }
    }
    System.out.println("********************************************************************");
    System.out.println("共检索出符合条件的Document "+hits.length()+" 个。");
    Date finishTime = new Date();
    long timeOfSearch = finishTime.getTime() - startTime.getTime();
    System.out.println("本次搜索所用的时间为 "+timeOfSearch+" ms");
   } catch (CorruptIndexException e) {
    e.printStackTrace();
   } catch (IOException e) {
    e.printStackTrace();
   }

}

测试结果如下所示：

Document的内部编号为： 56
Document的得分为： 1.0
Document(编号) 56 的Field的信息：
    Field的name : path
    Field的stringValue : E:\Lucene\txt1\mytxt\文件.txt
    ------------------------------------
    Field的name : modified
    Field的stringValue : 200804200649
    ------------------------------------
Document的内部编号为： 41
Document的得分为： 0.57587546
Document(编号) 41 的Field的信息：
    Field的name : path
    Field的stringValue : E:\Lucene\txt1\mytxt\Update.txt
    ------------------------------------
    Field的name : modified
    Field的stringValue : 200707050028
    ------------------------------------
Document的内部编号为： 46
Document的得分为： 0.5728219
Document(编号) 46 的Field的信息：
    Field的name : path
    Field的stringValue : E:\Lucene\txt1\mytxt\使用技巧集萃.txt
    ------------------------------------
    Field的name : modified
    Field的stringValue : 200511210413
    ------------------------------------
Document的内部编号为： 24
Document的得分为： 0.45140085
Document(编号) 24 的Field的信息：
    Field的name : path
    Field的stringValue : E:\Lucene\txt1\mytxt\FAQ.txt
    ------------------------------------
    Field的name : modified
    Field的stringValue : 200604130754
    ------------------------------------
Document的内部编号为： 44
Document的得分为： 0.4285714
Document(编号) 44 的Field的信息：
    Field的name : path
    Field的stringValue : E:\Lucene\txt1\mytxt\Visual Studio 2005注册升级.txt
    ------------------------------------
    Field的name : modified
    Field的stringValue : 200801300512
    ------------------------------------
Document的内部编号为： 12
Document的得分为： 0.39528468
Document(编号) 12 的Field的信息：
    Field的name : path
    Field的stringValue : E:\Lucene\txt1\mytxt\CustomKeyInfo.txt
    ------------------------------------
    Field的name : modified
    Field的stringValue : 200406041814
    ------------------------------------
Document的内部编号为： 58
Document的得分为： 0.33881545
Document(编号) 58 的Field的信息：
    Field的name : path
    Field的stringValue : E:\Lucene\txt1\mytxt\新建文本文档.txt
    ------------------------------------
    Field的name : modified
    Field的stringValue : 200710270258
    ------------------------------------
Document的内部编号为： 64
Document的得分为： 0.28571427
Document(编号) 64 的Field的信息：
    Field的name : path
    Field的stringValue : E:\Lucene\txt1\疑问即时记录.txt
    ------------------------------------
    Field的name : modified
    Field的stringValue : 200711141408
    ------------------------------------
Document的内部编号为： 60
Document的得分为： 0.17857142
Document(编号) 60 的Field的信息：
    Field的name : path
    Field的stringValue : E:\Lucene\txt1\mytxt\汉化说明.txt
    ------------------------------------
    Field的name : modified
    Field的stringValue : 200708210247
    ------------------------------------
Document的内部编号为： 14
Document的得分为： 0.06313453
Document(编号) 14 的Field的信息：
    Field的name : path
    Field的stringValue : E:\Lucene\txt1\mytxt\CustomKeysSample.txt
    ------------------------------------
    Field的name : modified
    Field的stringValue : 200610100451
    ------------------------------------
********************************************************************
共检索出符合条件的Document 10 个。
本次搜索所用的时间为 640 ms

可见一共检索出10个Document满足条件，即10个Document中都存在与词条“文件”匹配的文件，当然是Field的contents。

PhraseQuery仅仅提供了一个构造方法：

public PhraseQuery() {}

没有参数，没有方法体内容，但是，在使用的时候要用到PhraseQuery的add方法，将由关键字构造的多个词条添加到构造的这个PhraseQuery实例中，实现复杂的检索。

add方法有两个重载的方法，含有一个参数Term的只是把构造的简单词条添加到PhraseQuery中，另一个含有两个参数：

public void add(Term term, int position)

其中，position指定了多个根据用户提交的检索关键字进行分词，分成多个简单的词条，这些词条之间可以存在position个空位，比如用户输入“天地”，如果使用StandardAnalyzer分析器实现后台分词，并且指定了position=1，则目标文件中含有“惊天动地”、“天高地厚”等等词语都能被检索出来。

另外，PhraseQuery还提供了下面方法：

public void setSlop(int s) { slop = s; }

slop 默认值为0，即表示单个简单的词条严格按照顺序组合成新的词条进行检索，亦即：它们之间没有空隙，如果设置为3，表示这些简单的词条之间可以“漏掉”或者“多添”了至多3个无关的字。它与

public void add(Term term, int position)

中的position不同，position是严格按照position个空缺位置检索。

而slop 是>=slop 个空缺都都可以，它可以包含0，1，……，slop-1，是一个空缺长度不同的范围。

阅读(1546) | 评论(0) | 转发(0) |

上一篇：Lucene-2.3.1 源代码阅读学习(33)

下一篇：Lucene-2.3.1 源代码阅读学习(35)

给主人留下些什么吧！~~

感谢所有关心和支持过ChinaUnix的朋友们

16024965号-6