Mailing-List: contact dev-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@lucene.apache.org
Received-SPF: neutral (athena.apache.org: local policy)
MIME-Version: 1.0
In-Reply-To: <AANLkTi=qfADdwNZyngp-wzp3LyYXCr6+DeKHf8R1yUhy@mail.gmail.com>
References: <AANLkTi=qfADdwNZyngp-wzp3LyYXCr6+DeKHf8R1yUhy@mail.gmail.com>
Date: Sun, 24 Oct 2010 05:23:18 -0400
Message-ID: <AANLkTimvjx2m-rN1AsypFzccFydVFqPKwfyX=GphDC-X@mail.gmail.com>
Subject: Re: Using filters to speed up queries
From: Michael McCandless <lucene@mikemccandless.com>
To: dev@lucene.apache.org
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

Unfortunately, Lucene's performance with filters isn't great.

This is because we now always apply filters "up high", using a
leapfrog approach, where we alternate asking the filter and then the
scorer to skip to each other's docID.

But if the filter accepts "enough" (~1% in my testing) of the
documents in the index, it's often better to apply the filter "down
low" like we do deleted docs (which really is its own filter), ie
where we quickly eliminate docs as we enumerate them from the
postings.

I did a blog post about this too:

  http://chbits.blogspot.com/2010/09/fast-search-filters-using-flex.html

That post shows some of the perf gains we could get by switching
filters to apply down low, though this was for a filter that randomly
accepts 50% of the index.  And this is using the flex APIs (for 4.0);
you may be able to do something similar using FilterIndexReader
pre-4.0.

Of course you shouldn't have to do such tricks --
https://issues.apache.org/jira/browse/LUCENE-1536 is open for Lucene
to do this itself when you pass a filter.

You should test, but, I suspect a MUST clause on an AND query may not
perform that much better in general for filters that accept a biggish
part of the index, since it's still using skipping, especially if your
query wasn't already a BooleanQuery.  For restrictive filters it
should be a decent gain, but those queries are already fast to begin
with.

Do you have some perf numbers to share?  What kind of queries are you
running with the filters?  Are there certain users that have a highish
%tg of the documents, with a long tail of the other users?  If so you
could consider making dedicated indices for those high doc count
users...

Also note that static index partitioning like this does not result in
the same scoring as you'd get if each user had their own index, since
the term stats (IDF) is aggregated across all users.  So for queries
with more than one term, users can see docs sorted differently, and
this is actually a known security risk in that users can gleen some
details about the documents they aren't allowed to see due to the
shared terms stats... there is a paper somewhere (Robert?) that delves
into it.

Mike

On Sat, Oct 23, 2010 at 6:18 PM, Khash Sajadi <khash@sajadi.co.uk> wrote:
> My index contains documents for different users. Each document has the us=
er
> id as a field on it.
> There are about 500 different users with 3 million documents.
> Currently I'm calling Search with the query (parsed from user)
> and=A0FieldCacheTermsFilter for the user id.
> It works but the performance is not great.
> Ideally, I would like to perform the search only on the documents that ar=
e
> relevant, this should make it much faster. However, it seems Search(Query=
,
> Filter) runs the query first and then applies the filter.
> Is there a way to improve this? (i.e. run the query only on a subset of
> documents)
> Thanks

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org