lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erick Erickson <erickerick...@gmail.com>
Subject Re: Efficient filtering advise
Date Mon, 23 Nov 2009 13:54:35 GMT
Oh my goodness yes. No wonder nothing I suggested made any
difference <G>. Ignore everything I've written....

OK, here's something to try, and it goes back to a Filter. Rather than
make this enormous bunch of ORs, try creating a Filter. Use TermDocs
to run through your list of IDs assembling a Filter (the one with the
 BitSet) and send *that* through to your query rather than
all your clauses. Running through the terms is much faster than I thought.

The only issue is that TermDocs is ordered by doc IDs, so you'll
have to use, say, a Map to see if the unique IDs you're after
correspond to your desired IDs.... Or who knows? Just using
TermDocs.seek(Term) for your list might do the trick.... Term.createTerm()
will help a bit here....

You can get a feel for this by just spinning through your list of
unique IDs using TermDocs.seek(Term) to see how long assembling
the filter would take. Using the Filter in a query doesn't cost
much at all....

Best
Erick


On Mon, Nov 23, 2009 at 8:12 AM, Eran Sevi <eransevi@gmail.com> wrote:

> Erick,
>
> Maybe I didn't make myself clear enough.
> I'm talking about high level filters used when searching.
>
> I construct a very big BooleanQuery and add 50K clauses to it (I removed
> the
> limit on max clauses).
> Each clause is a TermQuery on the same field.
> I don't know the internal doc ids that I want. I only know the value stored
> in the field. this value is an external key of the document and guaranteed
> to be unique.
>
> I pass this query to QueryFilter and run a search using search
> (query,filter,collector) (reminder: i'm using ver 2.4)
>
> If I missed some other magic construction method to achieve the same goal
> then it would be great.
>
> Thanks for you help so far,
> Eran.
>
> On Mon, Nov 23, 2009 at 2:55 PM, Erick Erickson <erickerickson@gmail.com
> >wrote:
>
> > Now I'm really confused, which usually means I'm making some
> > assumptions that aren't true. So here they are...
> >
> > 1> You're talking about Filters that contain BitSets, right? Not some
> other
> >     kind of filter.
> > 2> When you create your 10-50K filters, you wind up with a single filter
> >     by combining them all with one of the BitSet operators, right? And
> >     that *single* filter is the one you send to your query...
> >
> > If I'm off base here, could you post a reasonable extract of your filter
> > construction code, and how you use them to search? Because I don't
> > think we're all talking about the same thing here.....
> >
> > HTH
> > Erick@ThisMakesNoSenseToMe<G>...
> >
> > On Mon, Nov 23, 2009 at 5:18 AM, Eran Sevi <eransevi@gmail.com> wrote:
> >
> > > After commenting out the collector logic, the time is still more or
> less
> > > the
> > > same.
> > > Anyway, since without the filter collecting the documents is very fast
> > it's
> > > probably something with the filter itself.
> > >
> > > I don't know how the filter (or boolean query) work internally but
> > probably
> > > for 10K or 50K clauses, it does something that takes a lot of time. It
> > > might
> > > be because of the inner data structures that are used or maybe just the
> > > iteration on so many terms takes time.
> > >
> > > I'll continue to try and pinpoint the exact bottleneck postion, or
> maybe
> > > using the new filters in 2.9.1 might help.
> > >
> > > On Sun, Nov 22, 2009 at 8:36 PM, Erick Erickson <
> erickerickson@gmail.com
> > > >wrote:
> > >
> > > > Hmmm, could you show us what you do in your collector? Because
> > > > one of the gotchas about a collector is loading the documents in
> > > > the inner loop. Quick test: comment out whatever you're doing in
> > > > the underlying collector loop, and see if there's *any* noticeable
> > > > difference in speed. That'll tell you whether your problems
> > > > arise from the filter construction/search or what you're doing
> > > > in the collector....
> > > >
> > > > Best
> > > > Erick
> > > >
> > > > On Sun, Nov 22, 2009 at 11:41 AM, Eran Sevi <eransevi@gmail.com>
> > wrote:
> > > >
> > > > > I think it shouldn't take X5 times longer since the number of
> results
> > > is
> > > > > only about X2 times larger (and much smaller than the number of
> terms
> > > in
> > > > > the
> > > > > filter), but maybe I'm wrong here since I'm not familiar with the
> > > filter
> > > > > internals.
> > > > >
> > > > > Unfortunately, the time to construct the filter is mere
> milliseconds.
> > > > > almost all of the time (~5secs) are spent in the search method.
> > > > > I'm using a collector to retrieve all the results (and fetch a
> value
> > > for
> > > > > some fields) but without the filter this also takes less then a
> > second
> > > > for
> > > > > the same number of results.
> > > > >
> > > > > On Sun, Nov 22, 2009 at 5:57 PM, Erick Erickson <
> > > erickerickson@gmail.com
> > > > > >wrote:
> > > > >
> > > > > > Hmmm, I'm not very clear here. Are you saying that you
> effectively
> > > > > > form 10-50K filters and OR them all together? That would be
> > > > > > consistent with the 50K case taking approx. 5X a long as the
10K
> > > > > > case.....
> > > > > >
> > > > > > Do you know where in your code the time is being spent? That'd
> > > > > > be a big help in suggesting alternatives. If I'm on the right
> > track,
> > > > > > I'd expect the time to be spent assembling the filters.....
> > > > > >
> > > > > > Not much help here, but I'm having trouble wrapping my head
> > > > > > around this...
> > > > > >
> > > > > > Best
> > > > > > Erick
> > > > > >
> > > > > > On Sun, Nov 22, 2009 at 9:48 AM, Eran Sevi <eransevi@gmail.com>
> > > wrote:
> > > > > >
> > > > > > > Hi,
> > > > > > >
> > > > > > > I have a need to filter my queries using a rather large
subset
> of
> > > > terms
> > > > > > > (can
> > > > > > > be 10K or even 50K).
> > > > > > > All these terms are sure to exist in the index so the number
of
> > > > results
> > > > > > can
> > > > > > > be about the same number of terms in the filter.
> > > > > > > The terms are numbers but are not subsequent and are from
a
> large
> > > set
> > > > > of
> > > > > > > possible values (so range queries are probably not good
for
> me).
> > > > > > > The index itself is about 1M docs and running even a simple
> query
> > > > with
> > > > > > such
> > > > > > > a large filter takes a lot of time even if the number of
> results
> > is
> > > > > only
> > > > > > a
> > > > > > > few hundred docs.
> > > > > > > It seems like the speed is affected by the length of the
filter
> > > even
> > > > if
> > > > > > the
> > > > > > > number of results remains more or less the same, which
is
> logical
> > > but
> > > > > not
> > > > > > > by
> > > > > > > such a large loss of performance as I'm experiencing (running
> the
> > > > query
> > > > > > > with
> > > > > > > a 10K terms filter takes an average of 1s 187ms with 600
> results
> > > > while
> > > > > > > running it with a 50K terms filter takes an average of
5s 207ms
> > > with
> > > > > 1000
> > > > > > > results).
> > > > > > >
> > > > > > > Currently I'm using a QueryFilter with a boolean query
in which
> I
> > > > "OR"
> > > > > > the
> > > > > > > different terms together.
> > > > > > > I also can't use a cached filter efficiently since the
terms to
> > > > filter
> > > > > on
> > > > > > > change almost every query.
> > > > > > >
> > > > > > > I was wondering if there's a better way to filter my queries
so
> > > they
> > > > > > won't
> > > > > > > take a few seconds to run?
> > > > > > >
> > > > > > > Thanks in advance for any advise,
> > > > > > > Eran.
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message