lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Uwe Schindler" <...@thetaphi.de>
Subject RE: BooleanFilter MUST clauses and getDocIdSet(acceptDocs)
Date Thu, 08 Nov 2012 10:04:23 GMT
I further thought about this:

Maybe BooleanFilter should have 2 modes (Boolean ctor argument): Passing down the acceptDocs
to every filter (for the case where Filter calculation is expensive and accept docs help to
limit the calculations) or not passing down (if the filter is cheap and the multiple acceptDocs
bit checks for every single filter is more expensive – which is then more effective, e.g.
when the Filter is only a cached bitset). The first mode would also optimize the MUST/MUST_NOT
case to pass down the further restricted acceptDocs on later filters (just like FilteredQuery
does).

 

I will open issue for that, that is a good idea.

 

Uwe

 

-----

Uwe Schindler

H.-H.-Meier-Allee 63, D-28213 Bremen

 <http://www.thetaphi.de/> http://www.thetaphi.de

eMail: uwe@thetaphi.de

 

From: Uwe Schindler [mailto:uwe@thetaphi.de] 
Sent: Thursday, November 08, 2012 8:29 AM
To: dev@lucene.apache.org
Subject: RE: BooleanFilter MUST clauses and getDocIdSet(acceptDocs)

 

Hi David,

 

the idea of passing the already build bits for the MUST is a good idea and can be implemented
easily.

 

The reason why the acceptDocs were not passed down is the new way of filter works in Lucene
4.0 and to optimize caching. Because accept docs are the only thing that changes when deletions
are applied and filters are required to handle them separately:  whenever something is able
to cache (e.g. CachingWrapperFilter), the acceptDocs are not cached, so the underlying filters
get a null acceptDocs to produce the full bitset and the filtering is done when CachingWrapperFilter
gets the “uptodate” acceptDocs. But for this case this does not matter if the first filter
clause does not get acceptdocs, but later MUST clauses of course can get them (they are not
deletion-specific)!

 

Can you open issue to optimize the MUST case (possibly MUST_NOT, too)?

 

Another thing that could help here: You can stop using BooleanFilter if you can apply the
filters sequentially (only MUST clauses) by wrapping with multiple FilteredQuery: new FilteredQuery(new
FilteredQuery(originalQuery, clause1), clause2). If the DocIdSets enable bits() and the FilteredQuery
autodetection decides to use random access filters, the acceptdocs are also passed down from
the outside to the inner, removing the documents filtered out.

 

Uwe

 

-----

Uwe Schindler

H.-H.-Meier-Allee 63, D-28213 Bremen

http://www.thetaphi.de <http://www.thetaphi.de/> 

eMail: uwe@thetaphi.de

 

From: david.w.smiley@gmail.com [mailto:david.w.smiley@gmail.com] 
Sent: Wednesday, November 07, 2012 8:23 PM
To: dev@lucene.apache.org
Subject: BooleanFilter MUST clauses and getDocIdSet(acceptDocs)

 

I am about to write a Filter that only operates on a set of documents that have already passed
other filter(s).  It's rather expensive, since it has to use DocValues to examine a value
and then determine if its a match.  So it scales O(n) where n is the number of documents it
must see.  The 2nd arg of getDocIdSet is Bits acceptDocs.  Unfortunately Bits doesn't have
an int iterator but I can deal with that seeing if it extends DocIdSet.

 

I'm looking at BooleanFilter which I want to use and I notice that it passes null to filter.getDocIdSet
for acceptDocs, and it justifies this with the following comment:

// we dont pass acceptDocs, we will filter at the end using an additional filter

Uwe wrote this comment in relation to LUCENE-1536 (r1188624).

For the MUST clause loop, couldn't it give it the accumulated bits of the MUST clauses?  

 

~ David


Mime
View raw message