lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael McCandless <>
Subject Re: [jira] Created: (LUCENE-1533) Deleted documents as a Filter or top level Query
Date Wed, 04 Feb 2009 00:14:59 GMT

eks dev wrote:

>> I agree we should warn about this in the javadocs... can you work  
>> up a patch?
> I'll give it a try, no promise when, changing job, moving...
>> I plan to run some tests to figure out the performance tradeoffs  
>> here.
>> We switched to iterator access for a toplevel filter, as of  
>> LUCENE-584, but from
>> LUCENE-1476 it's looking like except for fairly sparse filters,  
>> random access is
>> much faster.
> have a look at
> this should be important for deletions case


> I'll just keep dumping my thinking about it, maybe something  
> meaningful comes out, unfortunately not enough time to think deeper  
> or try it now..

No problem; keep it coming!

> as long as we look at single term queries, high deletion density  
> cases should be faster with random access (or anything else) at  
> TermDocs level because we will be just propagating decision higher  
> up instead of "killing" document at TermDocs level. Cases with more  
> terms, disjunctions, are getting interesting, starting to feel speed- 
> up proportional to the number of intersectiong documents.
> for Query (A OR B) we need to check if(deleted) condition  #A + #B  
> times if we do it at TermDocs level, in filter case we need to do it  
> only
> #(A\B) + #(B\A) + #(A AND B) and this number is smaller or equal  
> (worst case) than  #A + #B
> this is exactly the case that makes performance headaches.

Right!  This is why I'm going to test the full matrix (different  
queries X different sparseness filters).

> We have two competing issues, constant time factor on skipTo() vs  
> get() and algorithmic enhancement due to saved checks. Balance  
> depends on Query and skipTo()/get() performance diff.
> Maybe thinking along the "Filter with both options" lines, random  
> (optional support) and iterator?

Right: I'm guessing we eventually need a simplistic query optimizer  
that would choose how to apply a top-level AND'd filtered (or AND NOT,  
eg for deletions).  If the filter is very sparse, it's probably best  
to use iterator especially if filter is already iterator-friendly, eg  

> At the end of a day, Filter works at API level with DocIdSet, not  
> DocIdSetIterator.... that would remove constant factor, the question  
> is this possible to add optional DocIdSet.get(int ) on current API  
> and use it for some specialized cases like this one.

Either add that random-access API, or make a RandomAccessDocIdSet,  
or... something.  Not sure yet.  Ideally BooleanQuery (which we should  
consolidate deletions & top-level filters under) should somehow ask  
the filter if it's random access, and then drive the matching  

> also, math for conjunctions looks much better in filters

I'll test conjunctions too.


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message