lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paul Elschot <>
Subject Re: Pre-filtering for expensive query
Date Wed, 03 Sep 2008 17:43:30 GMT
Op Wednesday 03 September 2008 18:06:57 schreef Matt Ronge:
> On Aug 30, 2008, at 3:01 PM, Paul Elschot wrote:
> > Op Saturday 30 August 2008 18:19:09 schreef Matt Ronge:
> >> On Aug 30, 2008, at 4:43 AM, Karl Wettin wrote:
> >>> Can you tell us a bit more about what you custom query does?
> >>> Perhaps you can build the "candidate filter" and reuse it over
> >>> and over again?
> >>
> >> I cannot reuse it. The candidate filter would be constructed by
> >> first running a boolean query with a number of SHOULD clauses. So
> >> then I know what docs atleast contain the terms I'm looking for.
> >> Once I have this set, I will look at the ordering of the matches
> >> (it's a bit more sophisticated than just a phrase query) and find
> >> the final matches.
> >
> > Sounds like you may want to take a look at SpanNearQuery.
> I'm going to take a second look at SpanNearQuery. I need it to
> support optional tokens, so I'm guessing I'll need to create a
> subclass to do that.

SpanNearQuery was not designed for optional tokens.
This can be tricky so make sure your specs are good. I know
only of this article for optional tokens and proximity:
Kunihiko Sadakane and Hiroshi Imai.  Fast algorithms for k -word 
proximity search. IEICE Trans. Fundamentals, E84-A(9), September 2001.

> >> Since my boolean clauses are different for each query I can't
> >> reuse the filter.
> >
> > With (a variation of) SpanNearQuery you may end up not needing
> > any filtering at all, because it already uses skipTo() where
> > possible.
> >
> > In case you are looking for documents that contain partial phrases
> > from an input query that has more than 2 words, have a look at
> > Nutch.
> I poked around in the Nutch docs and Javadocs, what should I look at
> in Nutch? What does it do exactly, is it the trick that Doug Cutting 
> mentioned where you concat neighboring terms together like "Hello
> world" becomes the token

That is an optimization for combinations of high frequency terms,
which is built into nutch iirc. But I don't know the details, so please
ask on a nutch list.

Paul Elschot

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message