lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doug Cutting <cutt...@apache.org>
Subject Re: auto-filters?
Date Thu, 06 Jan 2005 21:31:47 GMT
Paul Elschot wrote:
>>Filters are more efficient than query terms for many 
> 
> I think there are two reasons for the peformance gain:
> - having things in RAM, eg. the bits of a filter after it is computed once,
> - being able to search per field instead of per document.

Also, bit-vectors are constant-time to access.

> As a first impression I don't like using a boost value for this purpose.
> This would probably introduce problems for negative weights 
> and negative scores, even though these are currently not used.
> I'd rather keep the boosts and score values continuous and 
> without limits.

I've never been convinced that negative weights are useful.  Do you 
think that they are?

> Perhaps a better way to specify that some parts of a query have yes/no
> behaviour would be by designating a set of fields as 'pure boolean' or
> 'filtering', and pass this set to a query parser.
> Compared to the standard query parser, a query parser like that would
> only need to override some get...Query() methods on the basis
> of this set of fields.
> Typical 'filtering' fields are dates and primary keys.

As a design principal, Lucene has tried to avoid forcing folks to 
declare much about their documents and fields ahead of time.  Indexes 
with different fields indexed differently may be freely intermixed. 
Perhaps this is not worth preserving, but neither should we give it up 
lightly.

> In some cases it is possible to have better memory efficiency than one
> bit per document, see the compact sparse filter utilities I posted yesterday
> http://issues.apache.org/bugzilla/show_bug.cgi?id=32921
> I think this is most useful for reducing the filter cache size after various
> passes of collecting document id's on one or more BitSets.

This is great stuff!  Perhaps we should have a wrapper implemenation 
that, when the bit-density is less than 1/8 uses this representation, 
and when the bit-density is greater than 1/8 converts to a bit vector?

> I fully agree. BooleanScorer should first try and do all 'pure boolean'/ 
> 'filtering' work and then continue to determine the scores of the passing 
> documents.
> A possible design refinement:
> The 'pure boolean' queries could provide a PureBooleanScorer
> (subclass of Scorer) that throws an UnsupportedOperation exception for
> score(). These could then implement the Query.getFilterBits() operation above.

This API violates the "don't use exceptions for control flow" rule... 
Is your goal to get an efficient skipTo() for pure boolean queries?

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Mime
View raw message