Lucene4 bm25-jiangwen127-ChinaUnix博客

EricLiseo2register.blog.chinaunix.net

首页　| 　博文目录　| 　关于我

jiangwen127

博客访问： 2495119
博文数量： 392
博客积分： 7040
博客等级：少将
技术积分： 4138
用户组：普通用户
注册时间： 2009-06-17 13:03

个人简介

范德萨发而为

文章分类

全部博文（392）

nosql（1）
c/c++（7）
machine lea（67）
设计模式（1）
web架构（35）
关系型database（23）
distributed（11）
fuckingwindows（1）
SE（24）
life（9）
berkeleyDB（4）
beauty of math（3）
Java_study（11）
algorithm（77）
kernel（16）
hadoop（13）
programming（8）
network（9）
linux operation（14）
bash（12）
reading（5）
STL using（8）
intern（0）
job_hunter（29）
未分配的博文（4）

文章存档

2017年（5）

2016年（19）

2015年（34）

2014年（14）

2013年（47）

2012年（40）

2011年（51）

2010年（137）

2009年（45）

我的朋友

相关博文

Lucene4 bm25

分类： Java

2013-06-15 11:01:19

used index statistics per index segment, and make them available at search time.

To understand the new statistics, let's pretend we've indexed the following two example documents, each with only one field "title":

document 1: The Lion, the Witch, and the Wardrobe
document 2: The Da Vinci Code

Assume we tokenize on whitespace, commas are removed, all terms are downcased and we don't discard stop-words. Here are the statistics Lucene tracks:     TermsEnum.docFreq() How many documents contain at least one occurrence of the term in the field; 3.x indices also save this (TermEnum.docFreq()). For term "lion" docFreq is 1, and for term "the" it's 2.
    Terms.getSumDocFreq() Number of postings, i.e. sum of TermsEnum.docFreq() across all terms in the field. For our example documents this is 9.
    TermsEnum.totalTermFreq() Number of occurrences of this term in the field, across all documents. For term "the" it's 4, for term "vinci" it's 1.
    Terms.getSumTotalTermFreq() Number of term occurrences in the field, across all documents; this is the sum ofTermsEnum.totalTermFreq() across all unique terms in the field. For our example documents this is 11.
    Terms.getDocCount() How many documents have at least one term for this field. In our example documents, this is 2, but if for example one of the documents was missing the title field, it would be 1.
    Terms.getUniqueTermCount() How many unique terms were seen in this field. For our example documents this is 8. Note that this statistic is of limited utility for scoring, because it's only available per-segment and you cannot (efficiently!) compute this across all segments in the index (unless there is only one segment).
    Fields.getUniqueTermCount() Number of unique terms across all fields; this is the sum of Terms.getUniqueTermCount()across all fields. In our example documents this is 8. Note that this is also only available per-segment.
    Fields.getUniqueFieldCount() Number of unique fields. For our example documents this is 1; if we also had a body field and an abstract field, it would be 3. Note that this is also only available per-segment.
3.x indices only store TermsEnum.docFreq(), so if you want to experiment with the new scoring models in Lucene 4.0, you should either re-index or upgrade your index using IndexUpgrader. Note that the new scoring models all use the same single-byte norms format, so you can freely switch between them without re-indexing.

In addition to what's stored in the index, there are also these statistics available per-field, per-document while indexing, in the FieldInvertState passed to Similarity.computeNorm method for both 3.x and 4.0:     length How many tokens in the document. For document 1 it's 7; for document 2 it's 4.
    uniqueTermCount For this field in this document, how many unique terms are there? For document 1, it's 5; for document 2 it's 4.
    maxTermFrequency What was the count for the most frequent term in this document. For document 1 it's 3 ("the" occurs 3 times); for document 2 it's 1.
In 3.x, if you want to consume these indexing-time statistics, you'll have to save them away yourself (e.g., somehow encoding them into the single-byte norm value). However, since 4.0 uses doc values for norms, you have more freedom to encode these statistics however you'd like. Your custom similarity can then pull from these.

From these available statistics you're now free to derive other commonly used statistics:

Average field length across all documents is Terms.getSumTotalTermFreq() divided by maxDoc(or Terms.getDocCount(), if not all documents have the field).
Average within-document field term frequency is FieldInvertState.length divided byFieldInvertState.uniqueTermCount.
Average number of unique terms per field across all documents is Terms.getSumDocFreq()divided by maxDoc (or Terms.getDocCount(field), if not all documents have the field).

Remember that the statistics do not reflect deleted documents, until those documents are merged away; in general this also means that segment merging will alter scores! Similarly, if the field omits term frequencies, then the statistics will not be correct (though they will still beconsistent with one another: we will pretend each term occurred once per document).

http://blog.mikemccandless.com/2012/03/new-index-statistics-in-lucene-40.html

阅读(1431) | 评论(0) | 转发(0) |

上一篇：《看日记学git》之九(总结)

下一篇：Xapian 学习笔记2 -- 其中有关于values的使用场景描述

给主人留下些什么吧！~~

感谢所有关心和支持过ChinaUnix的朋友们

16024965号-6