Mailing-List: contact lucene-dev-help@jakarta.apache.org; run by ezmlm
Precedence: bulk
Reply-To: "Lucene Developers List" <lucene-dev@jakarta.apache.org>
Received-SPF: pass (hermes.apache.org: local policy)
From: Paul Elschot <paul.elschot@xs4all.nl>
To: lucene-dev@jakarta.apache.org
Subject: Re: auto-filters?
Date: Tue, 4 Jan 2005 13:43:07 +0100
User-Agent: KMail/1.5.4
References: <41D8C862.6060400@apache.org>
In-Reply-To: <41D8C862.6060400@apache.org>
MIME-Version: 1.0
Content-Type: text/plain;
  charset="iso-8859-1"
Content-Transfer-Encoding: 7bit
Content-Disposition: inline
Message-Id: <200501041343.07386.paul.elschot@xs4all.nl>

I'm replying to the Doug's first message, and I'll try
and answer things brought forward later too.

On Monday 03 January 2005 05:21, Doug Cutting wrote:
> Filters are more efficient than query terms for many things. For
> example, a RangeFilter is usually more efficient than a RangeQuery and
> has no risk of triggering BooleanQuery.TooManyClauses. And Filter
> caching (e.g., with CachingWrapperFilter) can make otherwise expensive
> clauses almost free, after the first time.

I think there are two reasons for the peformance gain:
- having things in RAM, eg. the bits of a filter after it is computed once,
- being able to search per field instead of per document.
These two are related: since only one bit per document is
needed to determine the search result (eg. per field), the (interim)
result can be stored in RAM.
 
> But filters are not obvious. Many Lucene applications that would
> benefit from them do not. Wouldn't it be better if we could
> automatically spot Query clauses which are amenable to
> filter-conversion? Then applications would just get faster and throw
> fewer exceptions, without having to know anything about filters.
>
>  From a user level I think this might work as follows:
> 
> 1. Query clauses which have a boost of 0.0 are candidates for filter
> conversion, since they cannot contribute to the score.  We should
> perhaps make boost=0 the default for certain classes of query (e.g.,
> perhaps RangeQuery) or make subclasses with this as the default
> (KeywordQuery).

As a first impression I don't like using a boost value for this purpose.
This would probably introduce problems for negative weights 
and negative scores, even though these are currently not used.
I'd rather keep the boosts and score values continuous and 
without limits.

In the Scorer class, the next() and skipTo() methods define whether
a document is present in the search results, and the score() method
determines the score. Iirc there is already an exception to this for
'exclusive or' like behaviour implemented by a Similarity iirc.
I think Filters should be used for next() and skipTo() only, see below.

The Scorer class implements a mixed vector and boolean information
retrieval model, and Filters are boolean by their nature.

Perhaps a better way to specify that some parts of a query have yes/no
behaviour would be by designating a set of fields as 'pure boolean' or
'filtering', and pass this set to a query parser.
Compared to the standard query parser, a query parser like that would
only need to override some get...Query() methods on the basis
of this set of fields.
Typical 'filtering' fields are dates and primary keys.
 
> 2. One should be able to specify a filter cache size per IndexSearcher,
> with the notion that each filter cached uses one bit per document.
> 
> I'm not yet clear how this should be implemented.  It might be based on
> something like:
> 
>    public interface DocIdCollector {
>      void collectDocId(int docId);
>    }
> 
>    /** Collects all DocIds that match the query.  DocIds are collected
>        in no particular order and may be collected more than once.
>        Returns true if this feature is supported, false otherwise. */
>    public boolean Query.getFilterBits(IndexReader, DocIdCollector);
> 
> Implementing this for various query classes is straight-forward.
> TermQuery might return null for any but very common terms (occurring in,
> e.g., greater than 10% of documents).  RangeQuery would use the logic
> that's currently in RangeFilter.  Etc.
> 
> BooleanScorer could then use this method to create a filter bit-vector
> for all of the boost=0.0 clauses, then use that to filter the other
> boost!=0 clauses.  The bit vectors could be cached in the scorer (using
> a LinkedHashMap), although I'm a little fuzzy on exactly how the cache
> API would work.

In some cases it is possible to have better memory efficiency than one
bit per document, see the compact sparse filter utilities I posted yesterday
http://issues.apache.org/bugzilla/show_bug.cgi?id=32921
I think this is most useful for reducing the filter cache size after various
passes of collecting document id's on one or more BitSets.
 
> I'm not convinced the above is the best design, but I am convinced
> Lucene needs a solution for this.  It could automatically eliminate most
> causes of BooleanQuery.TooManyClauses (e.g., from date ranges), and also
> make many required keyword clauses (document type, language, etc.) much
> faster.
> 
> What do others think?  Does anyone have a better design or improvements
> to what I describe?

I fully agree. BooleanScorer should first try and do all 'pure boolean'/ 
'filtering' work and then continue to determine the scores of the passing 
documents.
A possible design refinement:
The 'pure boolean' queries could provide a PureBooleanScorer
(subclass of Scorer) that throws an UnsupportedOperation exception for
score(). These could then implement the Query.getFilterBits() operation above.

Regards,
Paul Elschot


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org