lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erick Erickson <erickerick...@gmail.com>
Subject Re: Efficient filtering advise
Date Tue, 24 Nov 2009 00:14:03 GMT
This was a really silly idea I had <G>. If your time is being spent in the
scoring
in the first place, keeping the Filter out of the query and checking against
it later in your Collector won't change the timing because you'll have done
all the scoring anyway. But I only thought about it on the way home this
afternoon....

Erick

On Mon, Nov 23, 2009 at 12:12 PM, Erick Erickson <erickerickson@gmail.com>wrote:

> See: http://issues.apache.org/jira/browse/LUCENE-1427
>
> <http://issues.apache.org/jira/browse/LUCENE-1427>Short form: this is
> fixed, but not until 2.9.
>
> If you don't want to upgrade, you could always leave the Filter off
> your initial query and have your Collector insure that any docs
> were in the Filter......
>
> FWIW
> Erick
>
>
>
> On Mon, Nov 23, 2009 at 11:01 AM, Eran Sevi <eransevi@gmail.com> wrote:
>
>> I've taken TermsFilter from contrib which does exactly that and indeed the
>> speed was reduced to half, which starts to be reasonable for my needs.
>>
>> I've researched the regular QueryFilter and what I write here might not be
>> the complete picture:
>> I found out that most of the time is spent on scoring the documents in the
>> filter !
>> It looks like the QueryFilter (which caches a QueryWrapperFilter) is
>> actually running the query passed to it using a searcher and a regular hit
>> collector that justs discards the score and set the doc id bit in the
>> filter. The score is still being calculated (by BooleanScorer in my case)
>> which takes a lot of time.
>>
>> TermsFilter solves part of my problem since I actually need a complex
>> boolean query (there are other terms which are being AND to the long OR
>> list). Maybe I can separate the filter to several single Term filters
>> (maybe
>> using BooleanFilter also from contrib).
>>
>> The fact is that seeking on the term docs is much faster than performing
>> the
>> equivalent search using a boolean query althought one might think that
>> seeking on the term docs is what should be done by the boolean query as
>> well
>> and the results should be more or less the same when the query is on only
>> one field.
>>
>> I know scoring and constructing the search results are closely related to
>> each other so the real question is:
>> Is there any way to run a complex query filter WITHOUT actually scoring
>> the
>> hits?
>>
>> Eran
>>
>> On Mon, Nov 23, 2009 at 3:54 PM, Erick Erickson <erickerickson@gmail.com
>> >wrote:
>>
>> > Oh my goodness yes. No wonder nothing I suggested made any
>> > difference <G>. Ignore everything I've written....
>> >
>> > OK, here's something to try, and it goes back to a Filter. Rather than
>> > make this enormous bunch of ORs, try creating a Filter. Use TermDocs
>> > to run through your list of IDs assembling a Filter (the one with the
>> >  BitSet) and send *that* through to your query rather than
>> > all your clauses. Running through the terms is much faster than I
>> thought.
>> >
>> > The only issue is that TermDocs is ordered by doc IDs, so you'll
>> > have to use, say, a Map to see if the unique IDs you're after
>> > correspond to your desired IDs.... Or who knows? Just using
>> > TermDocs.seek(Term) for your list might do the trick....
>> Term.createTerm()
>> > will help a bit here....
>> >
>> > You can get a feel for this by just spinning through your list of
>> > unique IDs using TermDocs.seek(Term) to see how long assembling
>> > the filter would take. Using the Filter in a query doesn't cost
>> > much at all....
>> >
>> > Best
>> > Erick
>> >
>> >
>> > On Mon, Nov 23, 2009 at 8:12 AM, Eran Sevi <eransevi@gmail.com> wrote:
>> >
>> > > Erick,
>> > >
>> > > Maybe I didn't make myself clear enough.
>> > > I'm talking about high level filters used when searching.
>> > >
>> > > I construct a very big BooleanQuery and add 50K clauses to it (I
>> removed
>> > > the
>> > > limit on max clauses).
>> > > Each clause is a TermQuery on the same field.
>> > > I don't know the internal doc ids that I want. I only know the value
>> > stored
>> > > in the field. this value is an external key of the document and
>> > guaranteed
>> > > to be unique.
>> > >
>> > > I pass this query to QueryFilter and run a search using search
>> > > (query,filter,collector) (reminder: i'm using ver 2.4)
>> > >
>> > > If I missed some other magic construction method to achieve the same
>> goal
>> > > then it would be great.
>> > >
>> > > Thanks for you help so far,
>> > > Eran.
>> > >
>> > > On Mon, Nov 23, 2009 at 2:55 PM, Erick Erickson <
>> erickerickson@gmail.com
>> > > >wrote:
>> > >
>> > > > Now I'm really confused, which usually means I'm making some
>> > > > assumptions that aren't true. So here they are...
>> > > >
>> > > > 1> You're talking about Filters that contain BitSets, right? Not
>> some
>> > > other
>> > > >     kind of filter.
>> > > > 2> When you create your 10-50K filters, you wind up with a single
>> > filter
>> > > >     by combining them all with one of the BitSet operators, right?
>> And
>> > > >     that *single* filter is the one you send to your query...
>> > > >
>> > > > If I'm off base here, could you post a reasonable extract of your
>> > filter
>> > > > construction code, and how you use them to search? Because I don't
>> > > > think we're all talking about the same thing here.....
>> > > >
>> > > > HTH
>> > > > Erick@ThisMakesNoSenseToMe<G>...
>> > > >
>> > > > On Mon, Nov 23, 2009 at 5:18 AM, Eran Sevi <eransevi@gmail.com>
>> wrote:
>> > > >
>> > > > > After commenting out the collector logic, the time is still more
>> or
>> > > less
>> > > > > the
>> > > > > same.
>> > > > > Anyway, since without the filter collecting the documents is
very
>> > fast
>> > > > it's
>> > > > > probably something with the filter itself.
>> > > > >
>> > > > > I don't know how the filter (or boolean query) work internally
but
>> > > > probably
>> > > > > for 10K or 50K clauses, it does something that takes a lot of
>> time.
>> > It
>> > > > > might
>> > > > > be because of the inner data structures that are used or maybe
>> just
>> > the
>> > > > > iteration on so many terms takes time.
>> > > > >
>> > > > > I'll continue to try and pinpoint the exact bottleneck postion,
or
>> > > maybe
>> > > > > using the new filters in 2.9.1 might help.
>> > > > >
>> > > > > On Sun, Nov 22, 2009 at 8:36 PM, Erick Erickson <
>> > > erickerickson@gmail.com
>> > > > > >wrote:
>> > > > >
>> > > > > > Hmmm, could you show us what you do in your collector? Because
>> > > > > > one of the gotchas about a collector is loading the documents
in
>> > > > > > the inner loop. Quick test: comment out whatever you're
doing in
>> > > > > > the underlying collector loop, and see if there's *any*
>> noticeable
>> > > > > > difference in speed. That'll tell you whether your problems
>> > > > > > arise from the filter construction/search or what you're
doing
>> > > > > > in the collector....
>> > > > > >
>> > > > > > Best
>> > > > > > Erick
>> > > > > >
>> > > > > > On Sun, Nov 22, 2009 at 11:41 AM, Eran Sevi <eransevi@gmail.com
>> >
>> > > > wrote:
>> > > > > >
>> > > > > > > I think it shouldn't take X5 times longer since the
number of
>> > > results
>> > > > > is
>> > > > > > > only about X2 times larger (and much smaller than the
number
>> of
>> > > terms
>> > > > > in
>> > > > > > > the
>> > > > > > > filter), but maybe I'm wrong here since I'm not familiar
with
>> the
>> > > > > filter
>> > > > > > > internals.
>> > > > > > >
>> > > > > > > Unfortunately, the time to construct the filter is
mere
>> > > milliseconds.
>> > > > > > > almost all of the time (~5secs) are spent in the search
>> method.
>> > > > > > > I'm using a collector to retrieve all the results (and
fetch a
>> > > value
>> > > > > for
>> > > > > > > some fields) but without the filter this also takes
less then
>> a
>> > > > second
>> > > > > > for
>> > > > > > > the same number of results.
>> > > > > > >
>> > > > > > > On Sun, Nov 22, 2009 at 5:57 PM, Erick Erickson <
>> > > > > erickerickson@gmail.com
>> > > > > > > >wrote:
>> > > > > > >
>> > > > > > > > Hmmm, I'm not very clear here. Are you saying
that you
>> > > effectively
>> > > > > > > > form 10-50K filters and OR them all together?
That would be
>> > > > > > > > consistent with the 50K case taking approx. 5X
a long as the
>> > 10K
>> > > > > > > > case.....
>> > > > > > > >
>> > > > > > > > Do you know where in your code the time is being
spent?
>> That'd
>> > > > > > > > be a big help in suggesting alternatives. If I'm
on the
>> right
>> > > > track,
>> > > > > > > > I'd expect the time to be spent assembling the
filters.....
>> > > > > > > >
>> > > > > > > > Not much help here, but I'm having trouble wrapping
my head
>> > > > > > > > around this...
>> > > > > > > >
>> > > > > > > > Best
>> > > > > > > > Erick
>> > > > > > > >
>> > > > > > > > On Sun, Nov 22, 2009 at 9:48 AM, Eran Sevi <
>> eransevi@gmail.com
>> > >
>> > > > > wrote:
>> > > > > > > >
>> > > > > > > > > Hi,
>> > > > > > > > >
>> > > > > > > > > I have a need to filter my queries using
a rather large
>> > subset
>> > > of
>> > > > > > terms
>> > > > > > > > > (can
>> > > > > > > > > be 10K or even 50K).
>> > > > > > > > > All these terms are sure to exist in the
index so the
>> number
>> > of
>> > > > > > results
>> > > > > > > > can
>> > > > > > > > > be about the same number of terms in the
filter.
>> > > > > > > > > The terms are numbers but are not subsequent
and are from
>> a
>> > > large
>> > > > > set
>> > > > > > > of
>> > > > > > > > > possible values (so range queries are probably
not good
>> for
>> > > me).
>> > > > > > > > > The index itself is about 1M docs and running
even a
>> simple
>> > > query
>> > > > > > with
>> > > > > > > > such
>> > > > > > > > > a large filter takes a lot of time even if
the number of
>> > > results
>> > > > is
>> > > > > > > only
>> > > > > > > > a
>> > > > > > > > > few hundred docs.
>> > > > > > > > > It seems like the speed is affected by the
length of the
>> > filter
>> > > > > even
>> > > > > > if
>> > > > > > > > the
>> > > > > > > > > number of results remains more or less the
same, which is
>> > > logical
>> > > > > but
>> > > > > > > not
>> > > > > > > > > by
>> > > > > > > > > such a large loss of performance as I'm experiencing
>> (running
>> > > the
>> > > > > > query
>> > > > > > > > > with
>> > > > > > > > > a 10K terms filter takes an average of 1s
187ms with 600
>> > > results
>> > > > > > while
>> > > > > > > > > running it with a 50K terms filter takes
an average of 5s
>> > 207ms
>> > > > > with
>> > > > > > > 1000
>> > > > > > > > > results).
>> > > > > > > > >
>> > > > > > > > > Currently I'm using a QueryFilter with a
boolean query in
>> > which
>> > > I
>> > > > > > "OR"
>> > > > > > > > the
>> > > > > > > > > different terms together.
>> > > > > > > > > I also can't use a cached filter efficiently
since the
>> terms
>> > to
>> > > > > > filter
>> > > > > > > on
>> > > > > > > > > change almost every query.
>> > > > > > > > >
>> > > > > > > > > I was wondering if there's a better way to
filter my
>> queries
>> > so
>> > > > > they
>> > > > > > > > won't
>> > > > > > > > > take a few seconds to run?
>> > > > > > > > >
>> > > > > > > > > Thanks in advance for any advise,
>> > > > > > > > > Eran.
>> > > > > > > > >
>> > > > > > > >
>> > > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message