Chinaunix首页 | 论坛 | 博客
  • 博客访问: 4478073
  • 博文数量: 192
  • 博客积分: 10014
  • 博客等级: 上将
  • 技术积分: 8232
  • 用 户 组: 普通用户
  • 注册时间: 2006-07-21 00:22
文章分类

全部博文(192)

文章存档

2011年(4)

2009年(14)

2008年(174)

我的朋友

分类:

2008-06-22 13:15:19

 
 

关于FieldInfos类和FieldInfo类。

FieldInfo类与一个Document中的一个Field相对应,而FieldInfos类又是多个FieldInfo的容器,对每个Document的所有Field对应的FieldInfo进行管理。

FieldInfos类和FieldInfo类之间的关系,恰似SegmentInfos类(可以参考文章 Lucene-2.2.0 源代码阅读学习(18))和SegmentInfo类(可以参考文章 Lucene-2.2.0 源代码阅读学习(19))之间的关系。

FieldInfo类的实现比较简单,该类的定义如下所示:

package org.apache.lucene.index;

final class FieldInfo {
String name;    // 一个Field的名称
boolean isIndexed;    // 该Field是否被索引
int number;    // 该Field的编号

// 是否存储该Field的词条向量
boolean storeTermVector;
boolean storeOffsetWithTermVector;
boolean storePositionWithTermVector;

boolean omitNorms; // 是否忽略与被索引的该Field相关的norm文件信息
  
boolean storePayloads; // 是否该Field存储与词条位置相关的Payload

// 构造一个FieldInfo对象

FieldInfo(String na, boolean tk, int nu, boolean storeTermVector,
            boolean storePositionWithTermVector, boolean storeOffsetWithTermVector,
            boolean omitNorms, boolean storePayloads) {
    name = na;
    isIndexed = tk;
    number = nu;
    this.storeTermVector = storeTermVector;
    this.storeOffsetWithTermVector = storeOffsetWithTermVector;
    this.storePositionWithTermVector = storePositionWithTermVector;
    this.omitNorms = omitNorms;
    this.storePayloads = storePayloads;
}
}

上面就是2.2.0版本中FieldInfo类的全部定义。

下面是FieldInfos了的定义了,该类主要是通过FieldInfo来管理一个Document中的全部Field,源代码如下所示:

package org.apache.lucene.index;

import org.apache.lucene.document.Document;
import org.apache.lucene.document.Fieldable;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.IndexInput;
import org.apache.lucene.store.IndexOutput;

import java.io.IOException;
import java.util.*;

// FieldInfo描述的是Document中的Field的信息,而FieldInfos类是用来管理一个个的FieldInfo的。
final class FieldInfos {

// 下面一组byte成员,使用十六进制数初始化,用来管理FieldInfo的属性
  
static final byte IS_INDEXED = 0x1;    // 是否索引
static final byte STORE_TERMVECTOR = 0x2;    // 是否存储词条向量
static final byte STORE_POSITIONS_WITH_TERMVECTOR = 0x4;   // 是否存储与词条向量相关的位置
static final byte STORE_OFFSET_WITH_TERMVECTOR = 0x8;   // 是否存储与词条向量相关的offset
static final byte OMIT_NORMS = 0x10;       // 是否存储被忽略的norms
static final byte STORE_PAYLOADS = 0x20;    // 是否存储Payload
  
private ArrayList byNumber = new ArrayList();       //   byNumber是通过编号,用来存放FieldInfo的列表
private HashMap byName = new HashMap();   //   byNname是通过名称,用来存放FieldInfo的列表

FieldInfos() { }    //   没有参数的FieldInfos的构造函数

//    通过索引目录d和一个索引输入流name构造一个FieldInfos对象
FieldInfos(Directory d, String name) throws IOException {
    IndexInput input = d.openInput(name);
    try {
      read(input);    //   input输入流已打开,从索引目录中读取
    } finally {
      input.close();
    }
}

  //   为一个Document添加Field的信息(这种添加和直接向Document中添加Field不一样,这次添加的不是一些固有信息,是一些更详细的补充信息)
public void add(Document doc) {
    List fields = doc.getFields();    // 先获取到该Document中已经添加进去的所有Field
    Iterator fieldIterator = fields.iterator();
    while (fieldIterator.hasNext()) {
      Fieldable field = (Fieldable) fieldIterator.next();
      add(field.name(), field.isIndexed(), field.isTermVectorStored(), field.isStorePositionWithTermVector(),
              field.isStoreOffsetWithTermVector(), field.getOmitNorms());    // 传参,调用核心的add方法执行添加
    }
}

/**
   * 添加被索引的Field,需要指定是否具有词条向量
   */

public void addIndexed(Collection names, boolean storeTermVectors, boolean storePositionWithTermVector,boolean storeOffsetWithTermVector) {
    Iterator i = names.iterator();
    while (i.hasNext()) {
      add((String)i.next(), true, storeTermVectors, storePositionWithTermVector, storeOffsetWithTermVector);
    }
}

/**
   * 当Field没有存储词条向量,添加Field
   *
   * @param names The names of the fields
   * @param isIndexed Whether the fields are indexed or not
   *
   * @see #add(String, boolean)
   */

public void add(Collection names, boolean isIndexed) {
    Iterator i = names.iterator();
    while (i.hasNext()) {
      add((String)i.next(), isIndexed);
    }
}

/**
   * Calls 5 parameter add with false for all TermVector parameters.
   *
   * @param name The name of the Fieldable
   * @param isIndexed true if the field is indexed
   * @see #add(String, boolean, boolean, boolean, boolean)
   */

public void add(String name, boolean isIndexed) {
    add(name, isIndexed, false, false, false, false);
}

/**
   * Calls 5 parameter add with false for term vector positions and offsets.
   *
   * @param name The name of the field
   * @param isIndexed true if the field is indexed
   * @param storeTermVector true if the term vector should be stored
   */

public void add(String name, boolean isIndexed, boolean storeTermVector){
    add(name, isIndexed, storeTermVector, false, false, false);
}

/** If the field is not yet known, adds it. If it is known, checks to make
   * sure that the isIndexed flag is the same as was given previously for this
   * field. If not - marks it as being indexed. Same goes for the TermVector
   * parameters.
   *
   * @param name The name of the field
   * @param isIndexed true if the field is indexed
   * @param storeTermVector true if the term vector should be stored
   * @param storePositionWithTermVector true if the term vector with positions should be stored
   * @param storeOffsetWithTermVector true if the term vector with offsets should be stored
   */

public void add(String name, boolean isIndexed, boolean storeTermVector,
                  boolean storePositionWithTermVector, boolean storeOffsetWithTermVector) {

    add(name, isIndexed, storeTermVector, storePositionWithTermVector, storeOffsetWithTermVector, false);
}

    /** If the field is not yet known, adds it. If it is known, checks to make
   * sure that the isIndexed flag is the same as was given previously for this
   * field. If not - marks it as being indexed. Same goes for the TermVector
   * parameters.
   *
   * @param name The name of the field
   * @param isIndexed true if the field is indexed
   * @param storeTermVector true if the term vector should be stored
   * @param storePositionWithTermVector true if the term vector with positions should be stored
   * @param storeOffsetWithTermVector true if the term vector with offsets should be stored
   * @param omitNorms true if the norms for the indexed field should be omitted
   */

public void add(String name, boolean isIndexed, boolean storeTermVector,
                  boolean storePositionWithTermVector, boolean storeOffsetWithTermVector, boolean omitNorms) {
    add(name, isIndexed, storeTermVector, storePositionWithTermVector,
        storeOffsetWithTermVector, omitNorms, false);
}

/** 如果该Field没有被添加过,则添加它。如果已经添加过,核查后确保它的是否被索引标志位与已经存在的一致,如果是“不索引”标志,则修改标志位为true.
   *该add添加方法才是最核心的实现方法。
   * @param name The name of the field
   * @param isIndexed true if the field is indexed
   * @param storeTermVector true if the term vector should be stored
   * @param storePositionWithTermVector true if the term vector with positions should be stored
   * @param storeOffsetWithTermVector true if the term vector with offsets should be stored
   * @param omitNorms true if the norms for the indexed field should be omitted
   * @param storePayloads true if payloads should be stored for this field
   */
public FieldInfo add(String name, boolean isIndexed, boolean storeTermVector,
                       boolean storePositionWithTermVector, boolean storeOffsetWithTermVector,
                       boolean omitNorms, boolean storePayloads) {
    FieldInfo fi = fieldInfo(name);    //   根据指定的name构造一个FieldInfo对象
    if (fi == null) {    // 如果构造的FieldInfo为null,则调用addInternal()方法,重新构造一个
      return addInternal(name, isIndexed, storeTermVector, storePositionWithTermVector, storeOffsetWithTermVector, omitNorms, storePayloads);
    } else {    // 如果构造的FieldInfo不为null(即已经存在一个相同name的FieldInfo)
      if (fi.isIndexed != isIndexed) {    // 如果存在的FieldInfo被索引
        fi.isIndexed = true;                      // 一旦被索引了,总是索引
      }
      if (fi.storeTermVector != storeTermVector) {
        fi.storeTermVector = true;               // 一旦存储词条向量,总是存储
      }
      if (fi.storePositionWithTermVector != storePositionWithTermVector) {
        fi.storePositionWithTermVector = true;                // once vector, always vector
      }
      if (fi.storeOffsetWithTermVector != storeOffsetWithTermVector) {
        fi.storeOffsetWithTermVector = true;                // once vector, always vector
      }
      if (fi.omitNorms != omitNorms) {
        fi.omitNorms = false;                // 一旦存储norms,则总是存储norms
      }
      if (fi.storePayloads != storePayloads) {
        fi.storePayloads = true;
      }

    }
    return fi;    // 返回一个FieldInfo对象
}

private FieldInfo addInternal(String name, boolean isIndexed,
                                boolean storeTermVector, boolean storePositionWithTermVector,
                                boolean storeOffsetWithTermVector, boolean omitNorms, boolean storePayloads) {
    FieldInfo fi =
      new FieldInfo(name, isIndexed, byNumber.size(), storeTermVector, storePositionWithTermVector,
              storeOffsetWithTermVector, omitNorms, storePayloads);
    byNumber.add(fi);    // byNumber是一个List,将构造的FieldInfo加入到列表中
    byName.put(name, fi);    // byName是一个HashMap,其中的键值对表示一个名字为键,一个FieldInfo对象的引用作为值
    return fi;
}

public int fieldNumber(String fieldName) {    // 根据指定的Field的名称,获取该Field的编号
    try {
      FieldInfo fi = fieldInfo(fieldName);
      if (fi != null)
        return fi.number;
    }
    catch (IndexOutOfBoundsException ioobe) {
      return -1;
    }
    return -1;
}

public FieldInfo fieldInfo(String fieldName) {    // 根据指定的Field的名称,从byName列表中取出该Field
    return (FieldInfo) byName.get(fieldName);
}

//   根据指定的编号,获取Field的名称name
public String fieldName(int fieldNumber) {
    try {
      return fieldInfo(fieldNumber).name;
    }
    catch (NullPointerException npe) {
      return "";
    }
}

// 根据指定的Field的编号,获取一个FieldInfo对象
public FieldInfo fieldInfo(int fieldNumber) {
    try {
      return (FieldInfo) byNumber.get(fieldNumber);    // 从byNymber列表中取出索引为指定fieldNumber的FieldInfo对象
    }
    catch (IndexOutOfBoundsException ioobe) {
      return null;
    }
}

public int size() {    // 计算byName这个HashMap的大小
    return byNumber.size();
}

public boolean hasVectors() {    //    返回byName这个HashMap中FieldInfo指定对应的Field不存储词条向量的标志值,即false
    boolean hasVectors = false;
    for (int i = 0; i < size(); i++) {
      if (fieldInfo(i).storeTermVector) {
        hasVectors = true;
        break;
      }
    }
    return hasVectors;
}

public void write(Directory d, String name) throws IOException {    // 将FieldInfo的信息输出到索引目录中,name是索引目录中存在的索引段文件名segments.fnm(可以参考文章 Lucene-2.2.0 源代码阅读学习(21) 中,DocumentWriter的addDocument()方法)
    IndexOutput output = d.createOutput(name);
    try {
      write(output);    // 调用下面的write方法,对FieldInfo的信息进行格式化(输出)写入索引目录
    } finally {
      output.close();
    }
}

public void write(IndexOutput output) throws IOException { // 对FieldInfo的信息进行格式化(输出)写入索引目录
    output.writeVInt(size());
    for (int i = 0; i < size(); i++) {
      FieldInfo fi = fieldInfo(i);
      byte bits = 0x0;
      if (fi.isIndexed) bits |= IS_INDEXED;
      if (fi.storeTermVector) bits |= STORE_TERMVECTOR;
      if (fi.storePositionWithTermVector) bits |= STORE_POSITIONS_WITH_TERMVECTOR;
      if (fi.storeOffsetWithTermVector) bits |= STORE_OFFSET_WITH_TERMVECTOR;
      if (fi.omitNorms) bits |= OMIT_NORMS;
      if (fi.storePayloads) bits |= STORE_PAYLOADS;
      output.writeString(fi.name);
      output.writeByte(bits);
    }
}

private void read(IndexInput input) throws IOException { // 通过打开一个输入流,读取FieldInfo的信息
    int size = input.readVInt();//read in the size
    for (int i = 0; i < size; i++) {
      String name = input.readString().intern();
      byte bits = input.readByte();
      boolean isIndexed = (bits & IS_INDEXED) != 0;
      boolean storeTermVector = (bits & STORE_TERMVECTOR) != 0;
      boolean storePositionsWithTermVector = (bits & STORE_POSITIONS_WITH_TERMVECTOR) != 0;
      boolean storeOffsetWithTermVector = (bits & STORE_OFFSET_WITH_TERMVECTOR) != 0;
      boolean omitNorms = (bits & OMIT_NORMS) != 0;
      boolean storePayloads = (bits & STORE_PAYLOADS) != 0;
     
      addInternal(name, isIndexed, storeTermVector, storePositionsWithTermVector, storeOffsetWithTermVector, omitNorms, storePayloads);    //   调用addInternal()在内存中构造一个FieldInfo对象,对它进行管理
    }   
}

}

对FieldInfo类和FieldInfos类进行总结:

1、FieldInfo作为一个实体,保存了Field的一些主要的信息;

2、因为对Field的操作比较频繁,而每次管理都在内存中加载FieldInfo这个轻量级但信息很重要的对象,能够大大提高建立索引的速度;

3、FieldInfos包含的信息比较丰富,通过一个FieldInfo对象,调出Document中的Field到内存中,对每个Field进行详细的管理。

4、FieldInfos支持独立从索引目录中读取Document中的信息(主要根据Document参数,管理其中的Field),然后再写回到索引目录。

综合总结:

FieldInfos是对Document中的Field进行管理的,它主要是在内存中进行管理,然后写入到索引目录中。具体地,它所拥有的信息都被写入了一个与索引段文件相关的segments.fnm文件中(可以参考文章 Lucene-2.2.0 源代码阅读学习(21) ,在DocumentWriter类的addDocument()方法中可以看到)

DocumnetWriter类的实现,是在FieldInfos类的基础上。FieldInfos类对Document的所有的Field的静态信息进行管理,而DocumentWriter类表现出了更强大的管理Document的功能,主要是对Field进行了一些高级的操作,比如使用Analyzer分析器进行分词、对切分出来的词条进行排序(文档倒排)等等。

下一步,就要仔细研究DocumnetWriter类了。

阅读(798) | 评论(0) | 转发(0) |
给主人留下些什么吧!~~