lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Oliver Christ" <ochr...@ebscohost.com>
Subject RE: Suggesters: circumfix suggestions
Date Thu, 17 Jan 2013 13:55:32 GMT
In our case (very similar to the "Netflix movie titles" use case) the
AnalyzingSuggester's FST grows by a factor of ~5 when we generate the
token graph.

Looking up and joining individual "postings lists" for the individual
tokens would certainly work, but is certainly more work than injecting a
token graph generator into the index analyzer's token filter chain :)
(or modifying TokenStreamToAutomaton to generate the additional
transitions, but that may be too low-level).

Cheers, Oli

-----Original Message-----
From: Michael McCandless [mailto:lucene@mikemccandless.com] 
Sent: Wednesday, January 16, 2013 5:38 PM
To: java-user@lucene.apache.org
Subject: Re: Suggesters: circumfix suggestions

Netflix also does this, eg type transla (you need an account).

I think it'd be good to somehow support this (Lucene's suggesters don't
today).

The first two approaches should conceptually work, but both will bloat
the FST (I'd be curious to know how much!).

Maybe another approach would be ... to index only single tokens into the
suggester?  And then, from the user's query, run the suggester on each
token separately, and then do a second search (against a "normal"
Lucene index) to find all documents containing those tokens?

Eg, you'd index only "boston", "red", "sox", "rumor" into the FST, and
then have a separate search index with "boston red sox rumor" indexed as
a document.  If the user types "red so", then you run suggest on "red"
and on "so", and then run a hmm MultiPhraseQuery for
(red|redmond|reddit) (so|sox|sophomore|...) against the index?  How to
score/sort the resulting hits will be interesting ... if you have strong
priors / boost (e.g. you have a good source of "popularity" or
something) then you could sort by that ...

Mike McCandless

http://blog.mikemccandless.com

On Wed, Jan 16, 2013 at 4:27 PM, Oliver Christ <ochrist@ebscohost.com>
wrote:
> Hi,
>
>
>
> Has anyone tried to implement circumfix suggesters, where the 
> suggestion is a circumfix of the lookup string?
>
>
>
> E.g. "sox rumor" suggests "boston red sox rumors" (try it on 
> google.com).
>
>
>
> I think there are several of ways to implement this:
>
>
>
> *         Given some multiword term, add all word subsequences to the
> suggester individually ("boston red sox rumors" adds also "red sox 
> rumors", "sox rumors", "rumors") - that can be achieved using a 
> special TermFreqIterator. This turns the lookup problem into a 
> standard prefix search. While this works, it effectively modifies the 
> surface form, and the "full term" needs to be indexed and looked up
elsewhere.
>
> *         Constructing a token graph with appropriate substring arcs
> from the (hopefully linear) token sequence, using a special
TokenFilter.
> The benefit is that the surface form is always the same, but the 
> automaton may become large (at least if you are using an 
> AnalyzingSuggester).
>
> *         DIY, using suffix arrays or something similar.
>
>
>
> But I'm sure there are other ways and/or tradeoffs I haven't thought 
> about J I'd be interested in your feedback.
>
>
>
> Cheers, Oli
>
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message