Return-Path: Delivered-To: apmail-jakarta-lucene-dev-archive@www.apache.org Received: (qmail 19449 invoked from network); 4 Jan 2005 12:43:26 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (209.237.227.199) by minotaur-2.apache.org with SMTP; 4 Jan 2005 12:43:26 -0000 Received: (qmail 7789 invoked by uid 500); 4 Jan 2005 12:43:15 -0000 Delivered-To: apmail-jakarta-lucene-dev-archive@jakarta.apache.org Received: (qmail 7750 invoked by uid 500); 4 Jan 2005 12:43:15 -0000 Mailing-List: contact lucene-dev-help@jakarta.apache.org; run by ezmlm Precedence: bulk List-Unsubscribe: List-Subscribe: List-Help: List-Post: List-Id: "Lucene Developers List" Reply-To: "Lucene Developers List" Delivered-To: mailing list lucene-dev@jakarta.apache.org Received: (qmail 7719 invoked by uid 99); 4 Jan 2005 12:43:14 -0000 X-ASF-Spam-Status: No, hits=0.3 required=10.0 tests=FORGED_RCVD_HELO,RISK_FREE X-Spam-Check-By: apache.org Received-SPF: pass (hermes.apache.org: local policy) Received: from smtp-vbr10.xs4all.nl (HELO smtp-vbr10.xs4all.nl) (194.109.24.30) by apache.org (qpsmtpd/0.28) with ESMTP; Tue, 04 Jan 2005 04:43:10 -0800 Received: from k8l.lan (porta.xs4all.nl [80.127.24.69]) by smtp-vbr10.xs4all.nl (8.12.11/8.12.11) with ESMTP id j04Ch7bS032873 for ; Tue, 4 Jan 2005 13:43:07 +0100 (CET) (envelope-from paul.elschot@xs4all.nl) From: Paul Elschot To: lucene-dev@jakarta.apache.org Subject: Re: auto-filters? Date: Tue, 4 Jan 2005 13:43:07 +0100 User-Agent: KMail/1.5.4 References: <41D8C862.6060400@apache.org> In-Reply-To: <41D8C862.6060400@apache.org> MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit Content-Disposition: inline Message-Id: <200501041343.07386.paul.elschot@xs4all.nl> X-Virus-Scanned: by XS4ALL Virus Scanner X-Virus-Checked: Checked X-Spam-Rating: minotaur-2.apache.org 1.6.2 0/1000/N I'm replying to the Doug's first message, and I'll try and answer things brought forward later too. On Monday 03 January 2005 05:21, Doug Cutting wrote: > Filters are more efficient than query terms for many things. For > example, a RangeFilter is usually more efficient than a RangeQuery and > has no risk of triggering BooleanQuery.TooManyClauses. And Filter > caching (e.g., with CachingWrapperFilter) can make otherwise expensive > clauses almost free, after the first time. I think there are two reasons for the peformance gain: - having things in RAM, eg. the bits of a filter after it is computed once, - being able to search per field instead of per document. These two are related: since only one bit per document is needed to determine the search result (eg. per field), the (interim) result can be stored in RAM. > But filters are not obvious. Many Lucene applications that would > benefit from them do not. Wouldn't it be better if we could > automatically spot Query clauses which are amenable to > filter-conversion? Then applications would just get faster and throw > fewer exceptions, without having to know anything about filters. > > From a user level I think this might work as follows: > > 1. Query clauses which have a boost of 0.0 are candidates for filter > conversion, since they cannot contribute to the score. We should > perhaps make boost=0 the default for certain classes of query (e.g., > perhaps RangeQuery) or make subclasses with this as the default > (KeywordQuery). As a first impression I don't like using a boost value for this purpose. This would probably introduce problems for negative weights and negative scores, even though these are currently not used. I'd rather keep the boosts and score values continuous and without limits. In the Scorer class, the next() and skipTo() methods define whether a document is present in the search results, and the score() method determines the score. Iirc there is already an exception to this for 'exclusive or' like behaviour implemented by a Similarity iirc. I think Filters should be used for next() and skipTo() only, see below. The Scorer class implements a mixed vector and boolean information retrieval model, and Filters are boolean by their nature. Perhaps a better way to specify that some parts of a query have yes/no behaviour would be by designating a set of fields as 'pure boolean' or 'filtering', and pass this set to a query parser. Compared to the standard query parser, a query parser like that would only need to override some get...Query() methods on the basis of this set of fields. Typical 'filtering' fields are dates and primary keys. > 2. One should be able to specify a filter cache size per IndexSearcher, > with the notion that each filter cached uses one bit per document. > > I'm not yet clear how this should be implemented. It might be based on > something like: > > public interface DocIdCollector { > void collectDocId(int docId); > } > > /** Collects all DocIds that match the query. DocIds are collected > in no particular order and may be collected more than once. > Returns true if this feature is supported, false otherwise. */ > public boolean Query.getFilterBits(IndexReader, DocIdCollector); > > Implementing this for various query classes is straight-forward. > TermQuery might return null for any but very common terms (occurring in, > e.g., greater than 10% of documents). RangeQuery would use the logic > that's currently in RangeFilter. Etc. > > BooleanScorer could then use this method to create a filter bit-vector > for all of the boost=0.0 clauses, then use that to filter the other > boost!=0 clauses. The bit vectors could be cached in the scorer (using > a LinkedHashMap), although I'm a little fuzzy on exactly how the cache > API would work. In some cases it is possible to have better memory efficiency than one bit per document, see the compact sparse filter utilities I posted yesterday http://issues.apache.org/bugzilla/show_bug.cgi?id=32921 I think this is most useful for reducing the filter cache size after various passes of collecting document id's on one or more BitSets. > I'm not convinced the above is the best design, but I am convinced > Lucene needs a solution for this. It could automatically eliminate most > causes of BooleanQuery.TooManyClauses (e.g., from date ranges), and also > make many required keyword clauses (document type, language, etc.) much > faster. > > What do others think? Does anyone have a better design or improvements > to what I describe? I fully agree. BooleanScorer should first try and do all 'pure boolean'/ 'filtering' work and then continue to determine the scores of the passing documents. A possible design refinement: The 'pure boolean' queries could provide a PureBooleanScorer (subclass of Scorer) that throws an UnsupportedOperation exception for score(). These could then implement the Query.getFilterBits() operation above. Regards, Paul Elschot --------------------------------------------------------------------- To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org For additional commands, e-mail: lucene-dev-help@jakarta.apache.org