lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Christoph Boosz <christoph.bo...@googlemail.com>
Subject Re: faceted search performance
Date Tue, 13 Oct 2009 09:19:50 GMT
Ok, I will have a shot at the ascending docId order.

Chris

2009/10/13 Paul Elschot <paul.elschot@xs4all.nl>

> On Monday 12 October 2009 23:29:07 Christoph Boosz wrote:
> > Hi Paul,
> >
> > Thanks for your suggestion. I will test it within the next few days.
> > However, due to memory limitations, it will only work if the number of
> hits
> > is small enough, am I right?
>
> One can load a single term vector at a time, so in this case the memory
> limitation is only in the possibly large map of doc counters per term.
> For best performance try and load the term vectors in docId order,
> after the original query has completed.
>
> In any case it would be good to somehow limit the number of
> documents considered, for example by using the ones with the best
> query score.
>
> Limiting the number of terms would also be good, but that less easy.
>
> Regards,
> Paul Elschot
>
> >
> > Chris
> >
> > 2009/10/12 Paul Elschot <paul.elschot@xs4all.nl>
> >
> > > Chris,
> > >
> > > You could also store term vectors for all docs at indexing
> > > time, and add the termvectors for the matching docs into a
> > > (large) map of terms in RAM.
> > >
> > > Regards,
> > > Paul Elschot
> > >
> > >
> > > On Monday 12 October 2009 21:30:48 Christoph Boosz wrote:
> > > > Hi Jake,
> > > >
> > > > Thanks for your helpful explanation.
> > > > In fact, my initial solution was to traverse each document in the
> result
> > > > once and count the contained terms. As you mentioned, this process
> took a
> > > > lot of memory.
> > > > Trying to confine the memory usage with the facet approach, I was
> > > surprised
> > > > by the decline in performance.
> > > > Now I know it's nothing abnormal, at least.
> > > >
> > > > Chris
> > > >
> > > >
> > > > 2009/10/12 Jake Mannix <jake.mannix@gmail.com>
> > > >
> > > > > Hey Chris,
> > > > >
> > > > > On Mon, Oct 12, 2009 at 10:30 AM, Christoph Boosz <
> > > > > christoph.boosz@googlemail.com> wrote:
> > > > >
> > > > > > Thanks for your reply.
> > > > > > Yes, it's likely that many terms occur in few documents.
> > > > > >
> > > > > > If I understand you right, I should do the following:
> > > > > > -Write a HitCollector that simply increments a counter
> > > > > > -Get the filter for the user query once: new
> CachingWrapperFilter(new
> > > > > > QueryWrapperFilter(userQuery));
> > > > > > -Create a TermQuery for each term
> > > > > > -Perform the search and read the counter of the HitCollector
> > > > > >
> > > > > > I did that, but it didn't get faster. Any ideas why?
> > > > > >
> > > > >
> > > > > This killer is the "TermQuery for each term" part - this is huge.
> You
> > > need
> > > > > to invert this process,
> > > > > and use your query as is, but while walking in the HitCollector,
on
> > > each
> > > > > doc
> > > > > which matches
> > > > > your query, increment counters for each of the terms in that
> document
> > > > > (which
> > > > > means you need
> > > > > an in-memory forward lookup for your documents, like a multivalued
> > > > > FieldCache - and if you've
> > > > > got roughly the same number of terms as documents, this cache is
> likely
> > > to
> > > > > be as large as
> > > > > your entire index - a pretty hefty RAM cost).
> > > > >
> > > > > But a good thing to keep in mind is that doing this kind of
> faceting
> > > > > (massively multivalued
> > > > > on a huge term-set) requires a lot of computation, even if you have
> all
> > > the
> > > > > proper structures
> > > > > living in memory:
> > > > >
> > > > > For each document you look at (which matches your query), you need
> to
> > > look
> > > > > at all
> > > > > of the terms in that document, and increment a counter for that
> term.
> > >  So
> > > > > however much
> > > > > time it would normally take for you to do the driving query, it can
> > > take as
> > > > > much as that
> > > > > multiplied by the average number of terms in a document in your
> index.
> > >  If
> > > > > your documents
> > > > > are big, this could be a pretty huge latency penalty.
> > > > >
> > > > >  -jake
> > > > >
> > > >
> > >
> > >
> >
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message