lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael McCandless <luc...@mikemccandless.com>
Subject Re: FreeText Auto-suggest
Date Sun, 28 Jun 2015 10:48:52 GMT
Which documentation are you reading?

The analyzer you send to FreeTextSuggester should not make shingles
itself: the suggester does this internally, based on the grams
parameter.

Maybe look at the TestFreeTextSuggester unit test as an example?

Mike McCandless

http://blog.mikemccandless.com


On Sat, Jun 27, 2015 at 6:52 PM, Alessandro Benedetti
<benedetti.alex85@gmail.com> wrote:
> Hi guys,
> after reading the documentation for the FreetextSuggester I have some doubts
> :
>
> Actually the documentation is not clear enough.
> Let's try to understand this suggester.
>
> Building
> This suggester build a FST that it will use to provide the autocomplete
> feature running prefix searches on it .
> The terms it uses to generate the FST are the tokens produced by the
> "suggestFreeTextAnalyzerFieldType" .
>
> And this should be correct.
> So if we have a shingle token filter[1-3] ( we produce unigrams as well) in
> our analysis to keep it simple , from these original field values :
> "mp3 ipod"
> "mp3 player"
> "mp3 player ipod"
> "player of Real"
>
> -> we produce these list of possible suggestions in our FST :
>
> <mp3>
> <player>
> <ipod>
> <real>
> <of>
>
> <mp3 ipod>
> <mp3 player>
> <player ipod>
> <player of>
> <of real>
>
> <mp3 player ipod>
> <player of real>
>
> From the documentation I read :
>>
>> " ngrams: The max number of tokens out of which singles will be make the
>> dictionary. The default value is 2. Increasing this would mean you want more
>> than the previous 2 tokens to be taken into consideration when making the
>> suggestions. "
>
>
> This makes me confused, as I was not expecting this param to affect the
> suggestion dictionary.
> So I would like a clarification here from our masters :)
> At this point let's see what happens at query time .
>
> Query Time
> As my understanding the ngrams params will consider  the last N-1 tokens the
> user put separated by the space separator.
>
>> "Builds an ngram model from the text sent to {@link
>> * #build} and predicts based on the last grams-1 tokens in
>> * the request sent to {@link #lookup}. This tries to
>> * handle the "long tail" of suggestions for when the
>> * incoming query is a never before seen query string."
>
>
> Example , grams=3 should consider only the last 2 tokens
>
> special mp3 p -> mp3 p
>
> Then this query is analysed using the "suggestFreeTextAnalyzerFieldType" .
> We produce 3 tokens :
> <mp3>
> <p>
> <mp3 p>
>
> And we run the prefix matching on the FST .
>
> Conclusion
> My understanding is wrong for sure at some point, as the behaviour I get is
> different.
> Can we discuss this , clarify this and eventually put it in the official
> documentation ?
>
> Cheers
>
> --
> --------------------------
>
> Benedetti Alessandro
> Visiting card : http://about.me/alessandro_benedetti
>
> "Tyger, tyger burning bright
> In the forests of the night,
> What immortal hand or eye
> Could frame thy fearful symmetry?"
>
> William Blake - Songs of Experience -1794 England

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message