lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erick Erickson <erickerick...@gmail.com>
Subject Re: Efficient filtering advise
Date Mon, 23 Nov 2009 17:12:29 GMT
See: http://issues.apache.org/jira/browse/LUCENE-1427

<http://issues.apache.org/jira/browse/LUCENE-1427>Short form: this is fixed,
but not until 2.9.

If you don't want to upgrade, you could always leave the Filter off
your initial query and have your Collector insure that any docs
were in the Filter......

FWIW
Erick



On Mon, Nov 23, 2009 at 11:01 AM, Eran Sevi <eransevi@gmail.com> wrote:

> I've taken TermsFilter from contrib which does exactly that and indeed the
> speed was reduced to half, which starts to be reasonable for my needs.
>
> I've researched the regular QueryFilter and what I write here might not be
> the complete picture:
> I found out that most of the time is spent on scoring the documents in the
> filter !
> It looks like the QueryFilter (which caches a QueryWrapperFilter) is
> actually running the query passed to it using a searcher and a regular hit
> collector that justs discards the score and set the doc id bit in the
> filter. The score is still being calculated (by BooleanScorer in my case)
> which takes a lot of time.
>
> TermsFilter solves part of my problem since I actually need a complex
> boolean query (there are other terms which are being AND to the long OR
> list). Maybe I can separate the filter to several single Term filters
> (maybe
> using BooleanFilter also from contrib).
>
> The fact is that seeking on the term docs is much faster than performing
> the
> equivalent search using a boolean query althought one might think that
> seeking on the term docs is what should be done by the boolean query as
> well
> and the results should be more or less the same when the query is on only
> one field.
>
> I know scoring and constructing the search results are closely related to
> each other so the real question is:
> Is there any way to run a complex query filter WITHOUT actually scoring the
> hits?
>
> Eran
>
> On Mon, Nov 23, 2009 at 3:54 PM, Erick Erickson <erickerickson@gmail.com
> >wrote:
>
> > Oh my goodness yes. No wonder nothing I suggested made any
> > difference <G>. Ignore everything I've written....
> >
> > OK, here's something to try, and it goes back to a Filter. Rather than
> > make this enormous bunch of ORs, try creating a Filter. Use TermDocs
> > to run through your list of IDs assembling a Filter (the one with the
> >  BitSet) and send *that* through to your query rather than
> > all your clauses. Running through the terms is much faster than I
> thought.
> >
> > The only issue is that TermDocs is ordered by doc IDs, so you'll
> > have to use, say, a Map to see if the unique IDs you're after
> > correspond to your desired IDs.... Or who knows? Just using
> > TermDocs.seek(Term) for your list might do the trick....
> Term.createTerm()
> > will help a bit here....
> >
> > You can get a feel for this by just spinning through your list of
> > unique IDs using TermDocs.seek(Term) to see how long assembling
> > the filter would take. Using the Filter in a query doesn't cost
> > much at all....
> >
> > Best
> > Erick
> >
> >
> > On Mon, Nov 23, 2009 at 8:12 AM, Eran Sevi <eransevi@gmail.com> wrote:
> >
> > > Erick,
> > >
> > > Maybe I didn't make myself clear enough.
> > > I'm talking about high level filters used when searching.
> > >
> > > I construct a very big BooleanQuery and add 50K clauses to it (I
> removed
> > > the
> > > limit on max clauses).
> > > Each clause is a TermQuery on the same field.
> > > I don't know the internal doc ids that I want. I only know the value
> > stored
> > > in the field. this value is an external key of the document and
> > guaranteed
> > > to be unique.
> > >
> > > I pass this query to QueryFilter and run a search using search
> > > (query,filter,collector) (reminder: i'm using ver 2.4)
> > >
> > > If I missed some other magic construction method to achieve the same
> goal
> > > then it would be great.
> > >
> > > Thanks for you help so far,
> > > Eran.
> > >
> > > On Mon, Nov 23, 2009 at 2:55 PM, Erick Erickson <
> erickerickson@gmail.com
> > > >wrote:
> > >
> > > > Now I'm really confused, which usually means I'm making some
> > > > assumptions that aren't true. So here they are...
> > > >
> > > > 1> You're talking about Filters that contain BitSets, right? Not some
> > > other
> > > >     kind of filter.
> > > > 2> When you create your 10-50K filters, you wind up with a single
> > filter
> > > >     by combining them all with one of the BitSet operators, right?
> And
> > > >     that *single* filter is the one you send to your query...
> > > >
> > > > If I'm off base here, could you post a reasonable extract of your
> > filter
> > > > construction code, and how you use them to search? Because I don't
> > > > think we're all talking about the same thing here.....
> > > >
> > > > HTH
> > > > Erick@ThisMakesNoSenseToMe<G>...
> > > >
> > > > On Mon, Nov 23, 2009 at 5:18 AM, Eran Sevi <eransevi@gmail.com>
> wrote:
> > > >
> > > > > After commenting out the collector logic, the time is still more
or
> > > less
> > > > > the
> > > > > same.
> > > > > Anyway, since without the filter collecting the documents is very
> > fast
> > > > it's
> > > > > probably something with the filter itself.
> > > > >
> > > > > I don't know how the filter (or boolean query) work internally but
> > > > probably
> > > > > for 10K or 50K clauses, it does something that takes a lot of time.
> > It
> > > > > might
> > > > > be because of the inner data structures that are used or maybe just
> > the
> > > > > iteration on so many terms takes time.
> > > > >
> > > > > I'll continue to try and pinpoint the exact bottleneck postion, or
> > > maybe
> > > > > using the new filters in 2.9.1 might help.
> > > > >
> > > > > On Sun, Nov 22, 2009 at 8:36 PM, Erick Erickson <
> > > erickerickson@gmail.com
> > > > > >wrote:
> > > > >
> > > > > > Hmmm, could you show us what you do in your collector? Because
> > > > > > one of the gotchas about a collector is loading the documents
in
> > > > > > the inner loop. Quick test: comment out whatever you're doing
in
> > > > > > the underlying collector loop, and see if there's *any*
> noticeable
> > > > > > difference in speed. That'll tell you whether your problems
> > > > > > arise from the filter construction/search or what you're doing
> > > > > > in the collector....
> > > > > >
> > > > > > Best
> > > > > > Erick
> > > > > >
> > > > > > On Sun, Nov 22, 2009 at 11:41 AM, Eran Sevi <eransevi@gmail.com>
> > > > wrote:
> > > > > >
> > > > > > > I think it shouldn't take X5 times longer since the number
of
> > > results
> > > > > is
> > > > > > > only about X2 times larger (and much smaller than the number
of
> > > terms
> > > > > in
> > > > > > > the
> > > > > > > filter), but maybe I'm wrong here since I'm not familiar
with
> the
> > > > > filter
> > > > > > > internals.
> > > > > > >
> > > > > > > Unfortunately, the time to construct the filter is mere
> > > milliseconds.
> > > > > > > almost all of the time (~5secs) are spent in the search
method.
> > > > > > > I'm using a collector to retrieve all the results (and
fetch a
> > > value
> > > > > for
> > > > > > > some fields) but without the filter this also takes less
then a
> > > > second
> > > > > > for
> > > > > > > the same number of results.
> > > > > > >
> > > > > > > On Sun, Nov 22, 2009 at 5:57 PM, Erick Erickson <
> > > > > erickerickson@gmail.com
> > > > > > > >wrote:
> > > > > > >
> > > > > > > > Hmmm, I'm not very clear here. Are you saying that
you
> > > effectively
> > > > > > > > form 10-50K filters and OR them all together? That
would be
> > > > > > > > consistent with the 50K case taking approx. 5X a long
as the
> > 10K
> > > > > > > > case.....
> > > > > > > >
> > > > > > > > Do you know where in your code the time is being spent?
> That'd
> > > > > > > > be a big help in suggesting alternatives. If I'm on
the right
> > > > track,
> > > > > > > > I'd expect the time to be spent assembling the filters.....
> > > > > > > >
> > > > > > > > Not much help here, but I'm having trouble wrapping
my head
> > > > > > > > around this...
> > > > > > > >
> > > > > > > > Best
> > > > > > > > Erick
> > > > > > > >
> > > > > > > > On Sun, Nov 22, 2009 at 9:48 AM, Eran Sevi <
> eransevi@gmail.com
> > >
> > > > > wrote:
> > > > > > > >
> > > > > > > > > Hi,
> > > > > > > > >
> > > > > > > > > I have a need to filter my queries using a rather
large
> > subset
> > > of
> > > > > > terms
> > > > > > > > > (can
> > > > > > > > > be 10K or even 50K).
> > > > > > > > > All these terms are sure to exist in the index
so the
> number
> > of
> > > > > > results
> > > > > > > > can
> > > > > > > > > be about the same number of terms in the filter.
> > > > > > > > > The terms are numbers but are not subsequent
and are from a
> > > large
> > > > > set
> > > > > > > of
> > > > > > > > > possible values (so range queries are probably
not good for
> > > me).
> > > > > > > > > The index itself is about 1M docs and running
even a simple
> > > query
> > > > > > with
> > > > > > > > such
> > > > > > > > > a large filter takes a lot of time even if the
number of
> > > results
> > > > is
> > > > > > > only
> > > > > > > > a
> > > > > > > > > few hundred docs.
> > > > > > > > > It seems like the speed is affected by the length
of the
> > filter
> > > > > even
> > > > > > if
> > > > > > > > the
> > > > > > > > > number of results remains more or less the same,
which is
> > > logical
> > > > > but
> > > > > > > not
> > > > > > > > > by
> > > > > > > > > such a large loss of performance as I'm experiencing
> (running
> > > the
> > > > > > query
> > > > > > > > > with
> > > > > > > > > a 10K terms filter takes an average of 1s 187ms
with 600
> > > results
> > > > > > while
> > > > > > > > > running it with a 50K terms filter takes an average
of 5s
> > 207ms
> > > > > with
> > > > > > > 1000
> > > > > > > > > results).
> > > > > > > > >
> > > > > > > > > Currently I'm using a QueryFilter with a boolean
query in
> > which
> > > I
> > > > > > "OR"
> > > > > > > > the
> > > > > > > > > different terms together.
> > > > > > > > > I also can't use a cached filter efficiently
since the
> terms
> > to
> > > > > > filter
> > > > > > > on
> > > > > > > > > change almost every query.
> > > > > > > > >
> > > > > > > > > I was wondering if there's a better way to filter
my
> queries
> > so
> > > > > they
> > > > > > > > won't
> > > > > > > > > take a few seconds to run?
> > > > > > > > >
> > > > > > > > > Thanks in advance for any advise,
> > > > > > > > > Eran.
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message