lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erick Erickson <erickerick...@gmail.com>
Subject Re: Efficient filtering advise
Date Sun, 22 Nov 2009 18:36:38 GMT
Hmmm, could you show us what you do in your collector? Because
one of the gotchas about a collector is loading the documents in
the inner loop. Quick test: comment out whatever you're doing in
the underlying collector loop, and see if there's *any* noticeable
difference in speed. That'll tell you whether your problems
arise from the filter construction/search or what you're doing
in the collector....

Best
Erick

On Sun, Nov 22, 2009 at 11:41 AM, Eran Sevi <eransevi@gmail.com> wrote:

> I think it shouldn't take X5 times longer since the number of results is
> only about X2 times larger (and much smaller than the number of terms in
> the
> filter), but maybe I'm wrong here since I'm not familiar with the filter
> internals.
>
> Unfortunately, the time to construct the filter is mere milliseconds.
> almost all of the time (~5secs) are spent in the search method.
> I'm using a collector to retrieve all the results (and fetch a value for
> some fields) but without the filter this also takes less then a second for
> the same number of results.
>
> On Sun, Nov 22, 2009 at 5:57 PM, Erick Erickson <erickerickson@gmail.com
> >wrote:
>
> > Hmmm, I'm not very clear here. Are you saying that you effectively
> > form 10-50K filters and OR them all together? That would be
> > consistent with the 50K case taking approx. 5X a long as the 10K
> > case.....
> >
> > Do you know where in your code the time is being spent? That'd
> > be a big help in suggesting alternatives. If I'm on the right track,
> > I'd expect the time to be spent assembling the filters.....
> >
> > Not much help here, but I'm having trouble wrapping my head
> > around this...
> >
> > Best
> > Erick
> >
> > On Sun, Nov 22, 2009 at 9:48 AM, Eran Sevi <eransevi@gmail.com> wrote:
> >
> > > Hi,
> > >
> > > I have a need to filter my queries using a rather large subset of terms
> > > (can
> > > be 10K or even 50K).
> > > All these terms are sure to exist in the index so the number of results
> > can
> > > be about the same number of terms in the filter.
> > > The terms are numbers but are not subsequent and are from a large set
> of
> > > possible values (so range queries are probably not good for me).
> > > The index itself is about 1M docs and running even a simple query with
> > such
> > > a large filter takes a lot of time even if the number of results is
> only
> > a
> > > few hundred docs.
> > > It seems like the speed is affected by the length of the filter even if
> > the
> > > number of results remains more or less the same, which is logical but
> not
> > > by
> > > such a large loss of performance as I'm experiencing (running the query
> > > with
> > > a 10K terms filter takes an average of 1s 187ms with 600 results while
> > > running it with a 50K terms filter takes an average of 5s 207ms with
> 1000
> > > results).
> > >
> > > Currently I'm using a QueryFilter with a boolean query in which I "OR"
> > the
> > > different terms together.
> > > I also can't use a cached filter efficiently since the terms to filter
> on
> > > change almost every query.
> > >
> > > I was wondering if there's a better way to filter my queries so they
> > won't
> > > take a few seconds to run?
> > >
> > > Thanks in advance for any advise,
> > > Eran.
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message