Lucene插件

兼容Lucene4.x,请前往项目主页下载此插件。

标准分词器

也即不做长词切分的分词器。 调用方法:

String text = "中华人民共和国很辽阔";
for (int i = 0; i < text.length(); ++i)
{
    System.out.print(text.charAt(i) + "" + i + " ");
}
System.out.println();
Analyzer analyzer = new HanLPAnalyzer();
TokenStream tokenStream = analyzer.tokenStream("field", text);
while (tokenStream.incrementToken())
{
    CharTermAttribute attribute = tokenStream.getAttribute(CharTermAttribute.class);
    // 偏移量
    OffsetAttribute offsetAtt = tokenStream.getAttribute(OffsetAttribute.class);
    // 距离
    PositionIncrementAttribute positionAttr = tokenStream.getAttribute(PositionIncrementAttribute.class);
    System.out.println(attribute + " " + offsetAtt.startOffset() + " " + offsetAtt.endOffset() + " " + positionAttr.getPositionIncrement());
}

索引分词器

长词全切分的索引分词模式。 调用方法:

String text = "中华人民共和国很辽阔";
for (int i = 0; i < text.length(); ++i)
{
    System.out.print(text.charAt(i) + "" + i + " ");
}
System.out.println();
Analyzer analyzer = new HanLPIndexAnalyzer();
TokenStream tokenStream = analyzer.tokenStream("field", text);
while (tokenStream.incrementToken())
{
    CharTermAttribute attribute = tokenStream.getAttribute(CharTermAttribute.class);
    // 偏移量
    OffsetAttribute offsetAtt = tokenStream.getAttribute(OffsetAttribute.class);
    // 距离
    PositionIncrementAttribute positionAttr = tokenStream.getAttribute(PositionIncrementAttribute.class);
    System.out.println(attribute + " " + offsetAtt.startOffset() + " " + offsetAtt.endOffset() + " " + positionAttr.getPositionIncrement());
}

自定义分词器

HanLP的Lucene插件提供灵活的构造HanLPTokenizer方法:

String text = "中华人民共和国很辽阔";
for (int i = 0; i < text.length(); ++i)
{
    System.out.print(text.charAt(i) + "" + i + " ");
}
System.out.println();
StringReader reader = new StringReader(text);
HashSet<String> filter = new HashSet<String>();
filter.add("很");
Tokenizer tokenizer = new HanLPTokenizer(IndexTokenizer.SEGMENT.enableOffset(true), reader, filter, false);
while (tokenizer.incrementToken())
{
    CharTermAttribute attribute = tokenizer.getAttribute(CharTermAttribute.class);
    // 偏移量
    OffsetAttribute offsetAtt = tokenizer.getAttribute(OffsetAttribute.class);
    // 距离
    PositionIncrementAttribute positionAttr = tokenizer.getAttribute(PositionIncrementAttribute.class);
    System.out.println(attribute + " " + offsetAtt.startOffset() + " " + offsetAtt.endOffset() + " " + positionAttr.getPositionIncrement());
}