Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 11748 invoked from network); 23 Nov 2009 16:02:09 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 23 Nov 2009 16:02:09 -0000 Received: (qmail 26403 invoked by uid 500); 23 Nov 2009 16:02:07 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 26312 invoked by uid 500); 23 Nov 2009 16:02:06 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 26298 invoked by uid 99); 23 Nov 2009 16:02:06 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 23 Nov 2009 16:02:06 +0000 X-ASF-Spam-Status: No, hits=2.2 required=10.0 tests=HTML_MESSAGE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of eransevi@gmail.com designates 74.125.78.27 as permitted sender) Received: from [74.125.78.27] (HELO ey-out-2122.google.com) (74.125.78.27) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 23 Nov 2009 16:01:56 +0000 Received: by ey-out-2122.google.com with SMTP id 22so563908eye.3 for ; Mon, 23 Nov 2009 08:01:36 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:in-reply-to:references :date:message-id:subject:from:to:content-type; bh=0FcN08cDA0361OeWJpQg/CMiBTRfN6E6iR2BzdhIJ2k=; b=jtLI+I0CAl6A+3bUz/HIiD0q5tV5xkLJprl69HfW1Tq8JSC/O64hVj5Y+J2nfF+B9X 7ASSHb8I30n0O6ug6FbugmPfwsZcokeDtM1fndZd9rWVMb/sweZQXWJPMSQyzsITV0KA 4Xu3rNwbNIrPJCRzFsiC9FzqyHOljokYl6evM= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; b=kcqzUbhchjqjN0oQTzRVObZVUa5bCVwc5uC9d/SOYnAL61RbjxSHcH2lINk4BqqkNL qCfIm4Kts3KfPcL8WHAYi36gfaNd/bFz3wz1vNMfJRdSrH4kekZCdMwI4KIb4+RnGZLw qMNPx/6eceq1mjolCazBc3Cq7eLuQc035Md/0= MIME-Version: 1.0 Received: by 10.213.63.75 with SMTP id a11mr1472420ebi.63.1258992096645; Mon, 23 Nov 2009 08:01:36 -0800 (PST) In-Reply-To: <359a92830911230554j29e90803q2c50389968d0341c@mail.gmail.com> References: <74f928500911220648s52467536r8233bb087440a54@mail.gmail.com> <359a92830911220757q3f48d3e6mf02e9b79a4d9d419@mail.gmail.com> <74f928500911220841x4c486c43i17a04929e3581e8a@mail.gmail.com> <359a92830911221036s571f6df7t8e9d803026eee12d@mail.gmail.com> <74f928500911230218p2770d754r9fd01e9e236beef6@mail.gmail.com> <359a92830911230455l7d6957cdp2309256805b753d0@mail.gmail.com> <74f928500911230512j5c72d77che06aeed839c2e0f6@mail.gmail.com> <359a92830911230554j29e90803q2c50389968d0341c@mail.gmail.com> Date: Mon, 23 Nov 2009 18:01:36 +0200 Message-ID: <74f928500911230801x4ade2dcbufd4eec88c37cc69f@mail.gmail.com> Subject: Re: Efficient filtering advise From: Eran Sevi To: java-user@lucene.apache.org Content-Type: multipart/alternative; boundary=00c09f79ebd5b2161d04790bf0de X-Virus-Checked: Checked by ClamAV on apache.org --00c09f79ebd5b2161d04790bf0de Content-Type: text/plain; charset=ISO-8859-1 I've taken TermsFilter from contrib which does exactly that and indeed the speed was reduced to half, which starts to be reasonable for my needs. I've researched the regular QueryFilter and what I write here might not be the complete picture: I found out that most of the time is spent on scoring the documents in the filter ! It looks like the QueryFilter (which caches a QueryWrapperFilter) is actually running the query passed to it using a searcher and a regular hit collector that justs discards the score and set the doc id bit in the filter. The score is still being calculated (by BooleanScorer in my case) which takes a lot of time. TermsFilter solves part of my problem since I actually need a complex boolean query (there are other terms which are being AND to the long OR list). Maybe I can separate the filter to several single Term filters (maybe using BooleanFilter also from contrib). The fact is that seeking on the term docs is much faster than performing the equivalent search using a boolean query althought one might think that seeking on the term docs is what should be done by the boolean query as well and the results should be more or less the same when the query is on only one field. I know scoring and constructing the search results are closely related to each other so the real question is: Is there any way to run a complex query filter WITHOUT actually scoring the hits? Eran On Mon, Nov 23, 2009 at 3:54 PM, Erick Erickson wrote: > Oh my goodness yes. No wonder nothing I suggested made any > difference . Ignore everything I've written.... > > OK, here's something to try, and it goes back to a Filter. Rather than > make this enormous bunch of ORs, try creating a Filter. Use TermDocs > to run through your list of IDs assembling a Filter (the one with the > BitSet) and send *that* through to your query rather than > all your clauses. Running through the terms is much faster than I thought. > > The only issue is that TermDocs is ordered by doc IDs, so you'll > have to use, say, a Map to see if the unique IDs you're after > correspond to your desired IDs.... Or who knows? Just using > TermDocs.seek(Term) for your list might do the trick.... Term.createTerm() > will help a bit here.... > > You can get a feel for this by just spinning through your list of > unique IDs using TermDocs.seek(Term) to see how long assembling > the filter would take. Using the Filter in a query doesn't cost > much at all.... > > Best > Erick > > > On Mon, Nov 23, 2009 at 8:12 AM, Eran Sevi wrote: > > > Erick, > > > > Maybe I didn't make myself clear enough. > > I'm talking about high level filters used when searching. > > > > I construct a very big BooleanQuery and add 50K clauses to it (I removed > > the > > limit on max clauses). > > Each clause is a TermQuery on the same field. > > I don't know the internal doc ids that I want. I only know the value > stored > > in the field. this value is an external key of the document and > guaranteed > > to be unique. > > > > I pass this query to QueryFilter and run a search using search > > (query,filter,collector) (reminder: i'm using ver 2.4) > > > > If I missed some other magic construction method to achieve the same goal > > then it would be great. > > > > Thanks for you help so far, > > Eran. > > > > On Mon, Nov 23, 2009 at 2:55 PM, Erick Erickson > >wrote: > > > > > Now I'm really confused, which usually means I'm making some > > > assumptions that aren't true. So here they are... > > > > > > 1> You're talking about Filters that contain BitSets, right? Not some > > other > > > kind of filter. > > > 2> When you create your 10-50K filters, you wind up with a single > filter > > > by combining them all with one of the BitSet operators, right? And > > > that *single* filter is the one you send to your query... > > > > > > If I'm off base here, could you post a reasonable extract of your > filter > > > construction code, and how you use them to search? Because I don't > > > think we're all talking about the same thing here..... > > > > > > HTH > > > Erick@ThisMakesNoSenseToMe... > > > > > > On Mon, Nov 23, 2009 at 5:18 AM, Eran Sevi wrote: > > > > > > > After commenting out the collector logic, the time is still more or > > less > > > > the > > > > same. > > > > Anyway, since without the filter collecting the documents is very > fast > > > it's > > > > probably something with the filter itself. > > > > > > > > I don't know how the filter (or boolean query) work internally but > > > probably > > > > for 10K or 50K clauses, it does something that takes a lot of time. > It > > > > might > > > > be because of the inner data structures that are used or maybe just > the > > > > iteration on so many terms takes time. > > > > > > > > I'll continue to try and pinpoint the exact bottleneck postion, or > > maybe > > > > using the new filters in 2.9.1 might help. > > > > > > > > On Sun, Nov 22, 2009 at 8:36 PM, Erick Erickson < > > erickerickson@gmail.com > > > > >wrote: > > > > > > > > > Hmmm, could you show us what you do in your collector? Because > > > > > one of the gotchas about a collector is loading the documents in > > > > > the inner loop. Quick test: comment out whatever you're doing in > > > > > the underlying collector loop, and see if there's *any* noticeable > > > > > difference in speed. That'll tell you whether your problems > > > > > arise from the filter construction/search or what you're doing > > > > > in the collector.... > > > > > > > > > > Best > > > > > Erick > > > > > > > > > > On Sun, Nov 22, 2009 at 11:41 AM, Eran Sevi > > > wrote: > > > > > > > > > > > I think it shouldn't take X5 times longer since the number of > > results > > > > is > > > > > > only about X2 times larger (and much smaller than the number of > > terms > > > > in > > > > > > the > > > > > > filter), but maybe I'm wrong here since I'm not familiar with the > > > > filter > > > > > > internals. > > > > > > > > > > > > Unfortunately, the time to construct the filter is mere > > milliseconds. > > > > > > almost all of the time (~5secs) are spent in the search method. > > > > > > I'm using a collector to retrieve all the results (and fetch a > > value > > > > for > > > > > > some fields) but without the filter this also takes less then a > > > second > > > > > for > > > > > > the same number of results. > > > > > > > > > > > > On Sun, Nov 22, 2009 at 5:57 PM, Erick Erickson < > > > > erickerickson@gmail.com > > > > > > >wrote: > > > > > > > > > > > > > Hmmm, I'm not very clear here. Are you saying that you > > effectively > > > > > > > form 10-50K filters and OR them all together? That would be > > > > > > > consistent with the 50K case taking approx. 5X a long as the > 10K > > > > > > > case..... > > > > > > > > > > > > > > Do you know where in your code the time is being spent? That'd > > > > > > > be a big help in suggesting alternatives. If I'm on the right > > > track, > > > > > > > I'd expect the time to be spent assembling the filters..... > > > > > > > > > > > > > > Not much help here, but I'm having trouble wrapping my head > > > > > > > around this... > > > > > > > > > > > > > > Best > > > > > > > Erick > > > > > > > > > > > > > > On Sun, Nov 22, 2009 at 9:48 AM, Eran Sevi > > > > > wrote: > > > > > > > > > > > > > > > Hi, > > > > > > > > > > > > > > > > I have a need to filter my queries using a rather large > subset > > of > > > > > terms > > > > > > > > (can > > > > > > > > be 10K or even 50K). > > > > > > > > All these terms are sure to exist in the index so the number > of > > > > > results > > > > > > > can > > > > > > > > be about the same number of terms in the filter. > > > > > > > > The terms are numbers but are not subsequent and are from a > > large > > > > set > > > > > > of > > > > > > > > possible values (so range queries are probably not good for > > me). > > > > > > > > The index itself is about 1M docs and running even a simple > > query > > > > > with > > > > > > > such > > > > > > > > a large filter takes a lot of time even if the number of > > results > > > is > > > > > > only > > > > > > > a > > > > > > > > few hundred docs. > > > > > > > > It seems like the speed is affected by the length of the > filter > > > > even > > > > > if > > > > > > > the > > > > > > > > number of results remains more or less the same, which is > > logical > > > > but > > > > > > not > > > > > > > > by > > > > > > > > such a large loss of performance as I'm experiencing (running > > the > > > > > query > > > > > > > > with > > > > > > > > a 10K terms filter takes an average of 1s 187ms with 600 > > results > > > > > while > > > > > > > > running it with a 50K terms filter takes an average of 5s > 207ms > > > > with > > > > > > 1000 > > > > > > > > results). > > > > > > > > > > > > > > > > Currently I'm using a QueryFilter with a boolean query in > which > > I > > > > > "OR" > > > > > > > the > > > > > > > > different terms together. > > > > > > > > I also can't use a cached filter efficiently since the terms > to > > > > > filter > > > > > > on > > > > > > > > change almost every query. > > > > > > > > > > > > > > > > I was wondering if there's a better way to filter my queries > so > > > > they > > > > > > > won't > > > > > > > > take a few seconds to run? > > > > > > > > > > > > > > > > Thanks in advance for any advise, > > > > > > > > Eran. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > --00c09f79ebd5b2161d04790bf0de--