Return-Path: Delivered-To: apmail-lucene-java-dev-archive@www.apache.org Received: (qmail 36029 invoked from network); 4 Feb 2009 00:15:35 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 4 Feb 2009 00:15:35 -0000 Received: (qmail 44770 invoked by uid 500); 4 Feb 2009 00:15:33 -0000 Delivered-To: apmail-lucene-java-dev-archive@lucene.apache.org Received: (qmail 44722 invoked by uid 500); 4 Feb 2009 00:15:33 -0000 Mailing-List: contact java-dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-dev@lucene.apache.org Delivered-To: mailing list java-dev@lucene.apache.org Received: (qmail 44713 invoked by uid 99); 4 Feb 2009 00:15:33 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 03 Feb 2009 16:15:33 -0800 X-ASF-Spam-Status: No, hits=1.2 required=10.0 tests=SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (nike.apache.org: local policy) Received: from [74.125.46.28] (HELO yw-out-2324.google.com) (74.125.46.28) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 04 Feb 2009 00:15:23 +0000 Received: by yw-out-2324.google.com with SMTP id 3so741232ywj.5 for ; Tue, 03 Feb 2009 16:15:01 -0800 (PST) Received: by 10.90.26.10 with SMTP id 10mr3872718agz.95.1233706501422; Tue, 03 Feb 2009 16:15:01 -0800 (PST) Received: from ?10.17.4.4? (pool-173-48-164-75.bstnma.fios.verizon.net [173.48.164.75]) by mx.google.com with ESMTPS id 20sm10540482agb.23.2009.02.03.16.15.00 (version=TLSv1/SSLv3 cipher=RC4-MD5); Tue, 03 Feb 2009 16:15:00 -0800 (PST) Message-Id: <9F8A5EF7-5385-43EE-AE6E-182BFD57483C@mikemccandless.com> From: Michael McCandless To: java-dev@lucene.apache.org In-Reply-To: <214130.38514.qm@web27104.mail.ukl.yahoo.com> Content-Type: text/plain; charset=US-ASCII; format=flowed; delsp=yes Content-Transfer-Encoding: 7bit Mime-Version: 1.0 (Apple Message framework v930.3) Subject: Re: [jira] Created: (LUCENE-1533) Deleted documents as a Filter or top level Query Date: Tue, 3 Feb 2009 19:14:59 -0500 References: <1586222813.1233361859583.JavaMail.jira@brutus> <39215.18156.qm@web27102.mail.ukl.yahoo.com> <498464E8.40700@gmail.com> <2B43E660-5A05-4A7F-A966-6C242DD4924D@mikemccandless.com> <387343.66168.qm@web27104.mail.ukl.yahoo.com> <82A09A1F-C049-4A76-A772-1C2FE3554C9E@mikemccandless.com> <214130.38514.qm@web27104.mail.ukl.yahoo.com> X-Mailer: Apple Mail (2.930.3) X-Virus-Checked: Checked by ClamAV on apache.org eks dev wrote: >> I agree we should warn about this in the javadocs... can you work >> up a patch? > > I'll give it a try, no promise when, changing job, moving... > >> I plan to run some tests to figure out the performance tradeoffs >> here. >> >> We switched to iterator access for a toplevel filter, as of >> LUCENE-584, but from >> LUCENE-1476 it's looking like except for fairly sparse filters, >> random access is >> much faster. > > have a look at https://issues.apache.org/jira/browse/LUCENE-1436 > this should be important for deletions case OK. > I'll just keep dumping my thinking about it, maybe something > meaningful comes out, unfortunately not enough time to think deeper > or try it now.. No problem; keep it coming! > as long as we look at single term queries, high deletion density > cases should be faster with random access (or anything else) at > TermDocs level because we will be just propagating decision higher > up instead of "killing" document at TermDocs level. Cases with more > terms, disjunctions, are getting interesting, starting to feel speed- > up proportional to the number of intersectiong documents. > > for Query (A OR B) we need to check if(deleted) condition #A + #B > times if we do it at TermDocs level, in filter case we need to do it > only > > #(A\B) + #(B\A) + #(A AND B) and this number is smaller or equal > (worst case) than #A + #B > > this is exactly the case that makes performance headaches. Right! This is why I'm going to test the full matrix (different queries X different sparseness filters). > We have two competing issues, constant time factor on skipTo() vs > get() and algorithmic enhancement due to saved checks. Balance > depends on Query and skipTo()/get() performance diff. > > Maybe thinking along the "Filter with both options" lines, random > (optional support) and iterator? Right: I'm guessing we eventually need a simplistic query optimizer that would choose how to apply a top-level AND'd filtered (or AND NOT, eg for deletions). If the filter is very sparse, it's probably best to use iterator especially if filter is already iterator-friendly, eg SortedVIntList. > At the end of a day, Filter works at API level with DocIdSet, not > DocIdSetIterator.... that would remove constant factor, the question > is this possible to add optional DocIdSet.get(int ) on current API > and use it for some specialized cases like this one. Either add that random-access API, or make a RandomAccessDocIdSet, or... something. Not sure yet. Ideally BooleanQuery (which we should consolidate deletions & top-level filters under) should somehow ask the filter if it's random access, and then drive the matching accordingly. > also, math for conjunctions looks much better in filters I'll test conjunctions too. Mike --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org For additional commands, e-mail: java-dev-help@lucene.apache.org