lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Otis Gospodnetic <otis_gospodne...@yahoo.com>
Subject Re: NGramTokenizer stops working after about 1000 terms
Date Mon, 04 Jan 2010 05:21:56 GMT
This actually rings a bell for me... have a look at Lucene's JIRA, I think this was reported
as a bug once and perhaps has been fixed.


Note that Lucene in Action 2 has a case study that talks about searching source code.  You
may find that study interesting.
 Otis
--
Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch



----- Original Message ----
> From: Stefan Trcek <wzzelfzzel@abas.de>
> To: java-user@lucene.apache.org
> Sent: Mon, December 14, 2009 9:39:34 AM
> Subject: NGramTokenizer stops working after about 1000 terms
> 
> Hello
> 
> For a source code (git repo) search engine I choose to use an ngram 
> analyzer for substring search (something like "git blame").
> 
> This worked fine except it didn't find some strings. I tracked it down 
> to the analyzer. When the ngram analyzer yielded about 1000 terms it 
> stopped yielding more terms, seem to be at most (1024 - ngram_length) 
> terms. When I use StandardAnalyzer it works as expected.
> Is this a bug or did I miss a limit?
> 
> Tested with lucene-2.9.1 and 3.0, this is the core routine I use:
> 
> public static class NGramAnalyzer5 extends Analyzer {
>     public TokenStream tokenStream(String fieldName, Reader reader) {
>         return new NGramTokenizer(reader, 5, 5);
>     }
> }
> 
> public static String[] analyzeString(Analyzer analyzer,
>             String fieldName, String string) throws IOException {
>     Listoutput = new ArrayList();
>     TokenStream tokenStream = analyzer.tokenStream(fieldName,
>             new StringReader(string));
>     TermAttribute termAtt = (TermAttribute)tokenStream.addAttribute(
>             TermAttribute.class);
>     tokenStream.reset();
>     while (tokenStream.incrementToken()) {
>         output.add(termAtt.term());
>     }
>     tokenStream.end();
>     tokenStream.close();
>     return output.toArray(new String[0]);
> }  
> 
> The complete example is attached. "in.txt" must be in "." and is plain 
> ASCII.
> 
> Stefan
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message