lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ivan Krišto <ivan.kri...@gmail.com>
Subject Re: ngrams in Lucene 4.3.0
Date Mon, 15 Jul 2013 19:31:27 GMT
On 07/15/2013 07:50 PM, Malgorzata Urbanska wrote:
> Hi,
>
> I've been trying  to figure out how to use ngrams in Lucene 4.3.0
> I found some examples for earlier version but I'm still confused.
> How I understand it, I should:
> 1. create a new analyzer which uses ngrams
> 2. apply it to my indexer
> 3. search using the same analyzer
>
> I found in a documentation: NGramTokenFilter and NGramTokenizer, but I
> do not understand what is the difference between them.
This should be helpful:
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#Tokenizers

Here is example of n-gram analyzer:

public class NGramAnalyzer extends Analyzer {
    @Override
    protected TokenStreamComponents createComponents(String fieldName,
            Reader reader) {

        Tokenizer src = new NGramTokenizer(reader, 3, 3);

        TokenStream tok = new StandardFilter(Version.LUCENE_43, src);
        tok = new LowerCaseFilter(Version.LUCENE_43, tok);

        return new TokenStreamComponents(src, tok) {
            @Override
            protected void setReader(final Reader reader) throws
IOException {
                super.setReader(reader);
            }
        };
    }
}

If, for example, you want to remove stop words from document before
breaking it into n-grams, than you would need:
reader(document) -> SomeTokenizer -> StopFilter -> NGramTokenFilter


  Regards,
    Ivan Krišto


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message