lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Kudrettin Güleryüz <kudret...@gmail.com>
Subject Re: NgramTokenizerFactory question
Date Mon, 02 Jul 2018 18:33:16 GMT
> 1) if you want face to match interface, you need max value to be at least
4.
Can you please explain this a bit more? I am not following this one. Values
are set to 3,3 and Solr already matches interface and interfaces when
searched for face.  In addition to that Solr matches the trigrams of face
(fac and ace) as well, which I find not as relevant as interface or faceted.

Application I am working on moving to Solr 7.3.1 is currently using Lucene
API 5.3.1 and has a custom analyzer like following:


public class TrigramCaseAnalyzer extends SourceSearchAnalyzer {
    private int indexType;

    public TrigramCaseAnalyzer() {
        indexType = 1;
    }

    @Override
    public int getIndexType() {
        return this.indexType;
    }

    @Override
    public void setIndexType(int type) {
        this.indexType = type;
    }

    @Override
    protected TokenStreamComponents createComponents(String fieldName) {
        Tokenizer st;
        st = new NGramTokenizer(3, 3);
        return new TokenStreamComponents(st);
    }
}

This somehow behaves as I described. (for a search: face returns interface
face faceted but not fac or ace).

Is there a change since 5.3.1 regarding this behavious in Lucene? Or is the
difference in behaviour caused by Solr's implementation of the Lucene API?

Thank you


On Mon, Jul 2, 2018 at 2:00 PM Alexandre Rafalovitch <arafalov@gmail.com>
wrote:

> Two things:
> 1) if you want face to match interface, you need max value to be at least
> 4.
> 2) you probably have the factory symmetrically or on Query analyzer. You
> probably want it on Index analyzer side only. Otherwise you are trying to
> match any 3-letter query substring against yoir index.
>
> Admin UI analysis screen will show that to you.
>
> Regards,
>     Alex
>
> On Mon, Jul 2, 2018, 11:01 AM Kudrettin Güleryüz, <kudrettin@gmail.com>
> wrote:
>
> > Hi,
> >
> > When using NgramTokenizerFactory with settings min ngram size=3 and max
> > ngram size=3 I get the following behaviour.
> >
> > Assume that search term is, face
> >
> > I expect the results to show documents with strings:
> > * interface or
> > * face or
> > * faceted
> >
> > but not
> > * ace or
> > * fac
> >
> > Why would I get the matches with results ace or fac? Am I missing some
> > settings somewhere? What is the suggested way to change this this
> > behaviour?
> >
> > Thank you,
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message