Mailing-List: contact java-dev-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: java-dev@lucene.apache.org
Received-SPF: neutral (nike.apache.org: local policy)
Message-Id: <9F8A5EF7-5385-43EE-AE6E-182BFD57483C@mikemccandless.com>
From: Michael McCandless <lucene@mikemccandless.com>
To: java-dev@lucene.apache.org
In-Reply-To: <214130.38514.qm@web27104.mail.ukl.yahoo.com>
Content-Type: text/plain; charset=US-ASCII; format=flowed; delsp=yes
Content-Transfer-Encoding: 7bit
Mime-Version: 1.0 (Apple Message framework v930.3)
Subject: Re: [jira] Created: (LUCENE-1533) Deleted documents as a Filter or
 top level Query
Date: Tue, 3 Feb 2009 19:14:59 -0500
References: <1586222813.1233361859583.JavaMail.jira@brutus>
 <39215.18156.qm@web27102.mail.ukl.yahoo.com> <498464E8.40700@gmail.com>
 <2B43E660-5A05-4A7F-A966-6C242DD4924D@mikemccandless.com>
 <387343.66168.qm@web27104.mail.ukl.yahoo.com>
 <82A09A1F-C049-4A76-A772-1C2FE3554C9E@mikemccandless.com>
 <214130.38514.qm@web27104.mail.ukl.yahoo.com>


eks dev wrote:

>> I agree we should warn about this in the javadocs... can you work  
>> up a patch?
>
> I'll give it a try, no promise when, changing job, moving...
>
>> I plan to run some tests to figure out the performance tradeoffs  
>> here.
>>
>> We switched to iterator access for a toplevel filter, as of  
>> LUCENE-584, but from
>> LUCENE-1476 it's looking like except for fairly sparse filters,  
>> random access is
>> much faster.
>
> have a look at  https://issues.apache.org/jira/browse/LUCENE-1436
> this should be important for deletions case

OK.

> I'll just keep dumping my thinking about it, maybe something  
> meaningful comes out, unfortunately not enough time to think deeper  
> or try it now..

No problem; keep it coming!

> as long as we look at single term queries, high deletion density  
> cases should be faster with random access (or anything else) at  
> TermDocs level because we will be just propagating decision higher  
> up instead of "killing" document at TermDocs level. Cases with more  
> terms, disjunctions, are getting interesting, starting to feel speed- 
> up proportional to the number of intersectiong documents.
>
> for Query (A OR B) we need to check if(deleted) condition  #A + #B  
> times if we do it at TermDocs level, in filter case we need to do it  
> only
>
> #(A\B) + #(B\A) + #(A AND B) and this number is smaller or equal  
> (worst case) than  #A + #B
>
> this is exactly the case that makes performance headaches.

Right!  This is why I'm going to test the full matrix (different  
queries X different sparseness filters).

> We have two competing issues, constant time factor on skipTo() vs  
> get() and algorithmic enhancement due to saved checks. Balance  
> depends on Query and skipTo()/get() performance diff.
>
> Maybe thinking along the "Filter with both options" lines, random  
> (optional support) and iterator?

Right: I'm guessing we eventually need a simplistic query optimizer  
that would choose how to apply a top-level AND'd filtered (or AND NOT,  
eg for deletions).  If the filter is very sparse, it's probably best  
to use iterator especially if filter is already iterator-friendly, eg  
SortedVIntList.

> At the end of a day, Filter works at API level with DocIdSet, not  
> DocIdSetIterator.... that would remove constant factor, the question  
> is this possible to add optional DocIdSet.get(int ) on current API  
> and use it for some specialized cases like this one.

Either add that random-access API, or make a RandomAccessDocIdSet,  
or... something.  Not sure yet.  Ideally BooleanQuery (which we should  
consolidate deletions & top-level filters under) should somehow ask  
the filter if it's random access, and then drive the matching  
accordingly.

> also, math for conjunctions looks much better in filters

I'll test conjunctions too.

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org