lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From eks dev <>
Subject Re: A lot of short documents, optimal query?
Date Sat, 12 Nov 2005 19:21:14 GMT
Hi Hoss, 
Good to hear that, I felt a bit fuzzy trying to grasp
all the possibilities.

I've read discussion from Doug's proposal for
implementing non-scoring Query features,
ConstantScoreQuery, Paul's FilteredQuery patch. 

And in summary options to avoid scoring: 

1. There is a consensus that Doug's proposal would be
the best way to proceed, but requires some time until
we get there. 

2. Filters are perfect for what they do good,
filtering. But using them for reimplementing
BooleanQuery, mirroring everything with filters would
introduce a lot of redundancy. BooleanQuery for exmple
does a lot of cool optimizations, shortcicuting 
expresins... and more or less the same things would
have to be reimplemented using filters. 

3. ConstantScoreQuery: 
I am a bit unsure here, but this looks like a bridge
that enables Filters to enter "regular" Query world. 

4. Paul's patch on FilteredQuery. This is "just" an
optimization to avoid unecessary scoring for doc's
that do not pass Filter (rather smart one tough). 

So, back to my use case:

You are totally right, ZIP codes are done best with
SetFilter (or PrefixFilter), no doubts about this one.
And they were the problem actually, so the solution is
allredy here.

But now when I learned about ConstntScoreQuery I
started thinking abot the following option:

The problem:
The first part of the query (part about thr field
"name" in my example) is combining term queries in
many strange ways using BooleanQuery, so using
ChainedFilter  would make thigs not so easy to read,
generalize and make right.  

So, what would you say about the following:

1. Make a TermFilter for all unique, high frequency
terms in my query (I have fequency info during
construction of the query). Of course, with simple
caching at TermFilter level is really simple. 

2. wrap those TermFilters in ConstantScoreQuery, 

3. combine this inside BooleanQuery as before (Boolean
mix of term queries and ConstantScoreQueries)

ZIPS field goes into SetFilter

Did I allready say "thank you!" for staying with me
while asking dumb questions :) And yes, if you get
close to Hanover, a good german beer on me is sure

--- Chris Hostetter <> wrote:

> : Wouldn't it make sense to have BooleanFilter,
> : TermFilter, MultiTermFilter, RangeFilter...
> fammily to
> : "mirror"  xxxQuery world with same idioms and
> : interfaces? Is this the direction allready taken
> in
> : Lucene development (an alternative would be to
> : parametrize existiong Query world). How I see it
> : functionaly, at a moment filters (and thir
> : combination) are the only way to use fast "pure
> : boolean" model.
> :
> : Does this what I just said makes any sense?
> It makes perfect sense, and you have grasped a ot of
> the possibilities.
> While making a version of Filter varient of every
> Query class is my gut
> instinct, there has in fact been discussion about
> generalizing Queries so
> that they can have "non-scoring" mode.  these issues
> have all been
> mentioned in LUCENE-383 ...
> One of the big reasons why it might make sense to
> use Queries instead of
> Filters even if you don't care about scoring is when
> you have a large set
> of very restrictive conditions.  (ie: A BooleanQuery
> consisting of many
> TermQueries).  the BooleanScorer can make good
> decisisons to skip over
> large sets of documents -- sometimes ignoring sub
> queries entirely -- when
> one sub query only matches a few documents because
> of hte flexability of
> the Scorer API.
> The Filter API on the other hand doesn't have this
> flexability.  There is
> not way for a ChainedFilter/BooleanFilter to know
> that it can skip over
> one of it's sub filters, or ask one of it's sub
> filters to only look at
> certain documents.
> I suggested the full Filter approach for your
> situation based on the
> following information...
>   1) you didn't care about scoring
>   2) you were using Range/Prefix queries on teh ZIP
> field that could
>      easily exceed practicle clause limits in
> BooleanQuery.
>   3) your restrictions on the ZIP field looked like
> they could be cached
>      individually so the and the results reused
> accross many searches
> -Hoss
> To unsubscribe, e-mail:
> For additional commands, e-mail:

To help you stay safe and secure online, we've developed the all new Yahoo! Security Centre.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message