lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stefan Trcek <wzzelfz...@abas.de>
Subject NGramTokenizer stops working after about 1000 terms
Date Mon, 14 Dec 2009 14:39:34 GMT
Hello

For a source code (git repo) search engine I choose to use an ngram 
analyzer for substring search (something like "git blame").

This worked fine except it didn't find some strings. I tracked it down 
to the analyzer. When the ngram analyzer yielded about 1000 terms it 
stopped yielding more terms, seem to be at most (1024 - ngram_length) 
terms. When I use StandardAnalyzer it works as expected.
Is this a bug or did I miss a limit?

Tested with lucene-2.9.1 and 3.0, this is the core routine I use:

public static class NGramAnalyzer5 extends Analyzer {
    public TokenStream tokenStream(String fieldName, Reader reader) {
        return new NGramTokenizer(reader, 5, 5);
    }
}

public static String[] analyzeString(Analyzer analyzer,
            String fieldName, String string) throws IOException {
    List<String> output = new ArrayList<String>();
    TokenStream tokenStream = analyzer.tokenStream(fieldName,
            new StringReader(string));
    TermAttribute termAtt = (TermAttribute)tokenStream.addAttribute(
            TermAttribute.class);
    tokenStream.reset();
    while (tokenStream.incrementToken()) {
        output.add(termAtt.term());
    }
    tokenStream.end();
    tokenStream.close();
    return output.toArray(new String[0]);
}  

The complete example is attached. "in.txt" must be in "." and is plain 
ASCII.

Stefan

Mime
View raw message