lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael McCandless <luc...@mikemccandless.com>
Subject Re: Suggesters: circumfix suggestions
Date Wed, 16 Jan 2013 22:37:55 GMT
Netflix also does this, eg type transla (you need an account).

I think it'd be good to somehow support this (Lucene's suggesters don't today).

The first two approaches should conceptually work, but both will bloat
the FST (I'd be curious to know how much!).

Maybe another approach would be ... to index only single tokens into
the suggester?  And then, from the user's query, run the suggester on
each token separately, and then do a second search (against a "normal"
Lucene index) to find all documents containing those tokens?

Eg, you'd index only "boston", "red", "sox", "rumor" into the FST, and
then have a separate search index with "boston red sox rumor" indexed
as a document.  If the user types "red so", then you run suggest on
"red" and on "so", and then run a hmm MultiPhraseQuery for
(red|redmond|reddit) (so|sox|sophomore|...) against the index?  How to
score/sort the resulting hits will be interesting ... if you have
strong priors / boost (e.g. you have a good source of "popularity" or
something) then you could sort by that ...

Mike McCandless

http://blog.mikemccandless.com

On Wed, Jan 16, 2013 at 4:27 PM, Oliver Christ <ochrist@ebscohost.com> wrote:
> Hi,
>
>
>
> Has anyone tried to implement circumfix suggesters, where the suggestion
> is a circumfix of the lookup string?
>
>
>
> E.g. "sox rumor" suggests "boston red sox rumors" (try it on
> google.com).
>
>
>
> I think there are several of ways to implement this:
>
>
>
> *         Given some multiword term, add all word subsequences to the
> suggester individually ("boston red sox rumors" adds also "red sox
> rumors", "sox rumors", "rumors") - that can be achieved using a special
> TermFreqIterator. This turns the lookup problem into a standard prefix
> search. While this works, it effectively modifies the surface form, and
> the "full term" needs to be indexed and looked up elsewhere.
>
> *         Constructing a token graph with appropriate substring arcs
> from the (hopefully linear) token sequence, using a special TokenFilter.
> The benefit is that the surface form is always the same, but the
> automaton may become large (at least if you are using an
> AnalyzingSuggester).
>
> *         DIY, using suffix arrays or something similar.
>
>
>
> But I'm sure there are other ways and/or tradeoffs I haven't thought
> about J I'd be interested in your feedback.
>
>
>
> Cheers, Oli
>
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message