分类:
2008-06-20 19:48:09
CharTokenizer是一个抽象类,它主要是对西文字符进行分词处理的。常见的英文中,是以空格、标点为分隔符号的,在分词的时候,就是以这些分隔符作为分词的间隔符的。
package org.apache.lucene.analysis;
import java.io.IOException;
import java.io.Reader;
// CharTokenizer 是一个抽象类
public abstract class CharTokenizer extends Tokenizer {
public CharTokenizer(Reader input) {
super(input);
}
private int offset = 0, bufferIndex = 0, dataLen = 0;
private static final int MAX_WORD_LEN = 255;
private static final int IO_BUFFER_SIZE = 1024;
private final char[] buffer = new char[MAX_WORD_LEN];
private final char[] ioBuffer = new char[IO_BUFFER_SIZE];
protected abstract boolean isTokenChar(char c);
// 对字符进行处理,可以在CharTokenizer 的子类中实现
protected char normalize(char c) {
return c;
}
// 这个是核心部分,返回分词后的词条
public final Token next() throws IOException {
int length = 0;
int start = offset;
while (true) {
final char c;
offset++;
if (bufferIndex >= dataLen) {
dataLen = input.read(ioBuffer);
bufferIndex = 0;
}
;
if (dataLen == -1) {
if (length > 0)
break;
else
return null;
} else
c = ioBuffer[bufferIndex++];
if (isTokenChar(c)) { // if it's a token char
if (length == 0) // start of token
start = offset - 1;
buffer[length++] = normalize(c); // buffer it, normalized
if (length == MAX_WORD_LEN) // buffer overflow!
break;
} else if (length > 0) // at non-Letter w/ chars
break; // return 'em
}
return new Token(new String(buffer, 0, length), start, start + length);
}
}
实现CharTokenizer的具体类有3个,分别为:LetterTokenizer、RussianLetterTokenizer、WhitespaceTokenizer。
先看看LetterTokenizer类,其它的2个都是基于CharTokenizer的,而核心又是next() 方法:
package org.apache.lucene.analysis;
import java.io.Reader;
// 只要读取到非字符的符号,就分词
public class LetterTokenizer extends CharTokenizer {
public LetterTokenizer(Reader in) {
super(in);
}
protected boolean isTokenChar(char c) {
return Character.isLetter(c);
}
}
做个测试就可以看到:
package org.shirdrn.lucene;
import java.io.IOException;
import java.io.Reader;
import java.io.StringReader;
import org.apache.lucene.analysis.LetterTokenizer;
public class LetterTokenizerTest {
public static void main(String[] args) {
Reader reader = new StringReader("That's a world,I wonder why.");
LetterTokenizer ct = new LetterTokenizer(reader);
try {
System.out.println(ct.next());
} catch (IOException e) {
e.printStackTrace();
}
}
}
运行结果如下:
(That,0,4)
在分词过程中,遇到了单引号,就把单引号之前的作为一个词条返回。
可以验证一下,把构造的Reader改成下面的形式:
Reader reader = new StringReader("ThatisaworldIwonderwhy.");
输出结果为:
(ThatisaworldIwonderwhy,0,22)
没有非字符的英文字母串就可以作为一个词条,一个词条长度的限制为255个字符,可以在CharTokenizer抽象类中看到定义:
private static final int MAX_WORD_LEN = 255;