Return-Path: Delivered-To: apmail-lucene-dev-archive@www.apache.org Received: (qmail 57803 invoked from network); 24 Oct 2010 09:23:48 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 24 Oct 2010 09:23:48 -0000 Received: (qmail 69466 invoked by uid 500); 24 Oct 2010 09:23:47 -0000 Delivered-To: apmail-lucene-dev-archive@lucene.apache.org Received: (qmail 69279 invoked by uid 500); 24 Oct 2010 09:23:46 -0000 Mailing-List: contact dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@lucene.apache.org Delivered-To: mailing list dev@lucene.apache.org Received: (qmail 69271 invoked by uid 99); 24 Oct 2010 09:23:46 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 24 Oct 2010 09:23:46 +0000 X-ASF-Spam-Status: No, hits=0.7 required=10.0 tests=RCVD_IN_DNSWL_NONE,SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (athena.apache.org: local policy) Received: from [74.125.82.176] (HELO mail-wy0-f176.google.com) (74.125.82.176) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 24 Oct 2010 09:23:41 +0000 Received: by wyg36 with SMTP id 36so2494716wyg.35 for ; Sun, 24 Oct 2010 02:23:19 -0700 (PDT) MIME-Version: 1.0 Received: by 10.216.22.193 with SMTP id t43mr1566606wet.82.1287912198739; Sun, 24 Oct 2010 02:23:18 -0700 (PDT) Received: by 10.216.64.205 with HTTP; Sun, 24 Oct 2010 02:23:18 -0700 (PDT) In-Reply-To: References: Date: Sun, 24 Oct 2010 05:23:18 -0400 Message-ID: Subject: Re: Using filters to speed up queries From: Michael McCandless To: dev@lucene.apache.org Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Unfortunately, Lucene's performance with filters isn't great. This is because we now always apply filters "up high", using a leapfrog approach, where we alternate asking the filter and then the scorer to skip to each other's docID. But if the filter accepts "enough" (~1% in my testing) of the documents in the index, it's often better to apply the filter "down low" like we do deleted docs (which really is its own filter), ie where we quickly eliminate docs as we enumerate them from the postings. I did a blog post about this too: http://chbits.blogspot.com/2010/09/fast-search-filters-using-flex.html That post shows some of the perf gains we could get by switching filters to apply down low, though this was for a filter that randomly accepts 50% of the index. And this is using the flex APIs (for 4.0); you may be able to do something similar using FilterIndexReader pre-4.0. Of course you shouldn't have to do such tricks -- https://issues.apache.org/jira/browse/LUCENE-1536 is open for Lucene to do this itself when you pass a filter. You should test, but, I suspect a MUST clause on an AND query may not perform that much better in general for filters that accept a biggish part of the index, since it's still using skipping, especially if your query wasn't already a BooleanQuery. For restrictive filters it should be a decent gain, but those queries are already fast to begin with. Do you have some perf numbers to share? What kind of queries are you running with the filters? Are there certain users that have a highish %tg of the documents, with a long tail of the other users? If so you could consider making dedicated indices for those high doc count users... Also note that static index partitioning like this does not result in the same scoring as you'd get if each user had their own index, since the term stats (IDF) is aggregated across all users. So for queries with more than one term, users can see docs sorted differently, and this is actually a known security risk in that users can gleen some details about the documents they aren't allowed to see due to the shared terms stats... there is a paper somewhere (Robert?) that delves into it. Mike On Sat, Oct 23, 2010 at 6:18 PM, Khash Sajadi wrote: > My index contains documents for different users. Each document has the us= er > id as a field on it. > There are about 500 different users with 3 million documents. > Currently I'm calling Search with the query (parsed from user) > and=A0FieldCacheTermsFilter for the user id. > It works but the performance is not great. > Ideally, I would like to perform the search only on the documents that ar= e > relevant, this should make it much faster. However, it seems Search(Query= , > Filter) runs the query first and then applies the filter. > Is there a way to improve this? (i.e. run the query only on a subset of > documents) > Thanks --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org For additional commands, e-mail: dev-help@lucene.apache.org