lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Christoph Boosz <christoph.bo...@googlemail.com>
Subject Re: faceted search performance
Date Mon, 12 Oct 2009 21:29:07 GMT
Hi Paul,

Thanks for your suggestion. I will test it within the next few days.
However, due to memory limitations, it will only work if the number of hits
is small enough, am I right?

Chris

2009/10/12 Paul Elschot <paul.elschot@xs4all.nl>

> Chris,
>
> You could also store term vectors for all docs at indexing
> time, and add the termvectors for the matching docs into a
> (large) map of terms in RAM.
>
> Regards,
> Paul Elschot
>
>
> On Monday 12 October 2009 21:30:48 Christoph Boosz wrote:
> > Hi Jake,
> >
> > Thanks for your helpful explanation.
> > In fact, my initial solution was to traverse each document in the result
> > once and count the contained terms. As you mentioned, this process took a
> > lot of memory.
> > Trying to confine the memory usage with the facet approach, I was
> surprised
> > by the decline in performance.
> > Now I know it's nothing abnormal, at least.
> >
> > Chris
> >
> >
> > 2009/10/12 Jake Mannix <jake.mannix@gmail.com>
> >
> > > Hey Chris,
> > >
> > > On Mon, Oct 12, 2009 at 10:30 AM, Christoph Boosz <
> > > christoph.boosz@googlemail.com> wrote:
> > >
> > > > Thanks for your reply.
> > > > Yes, it's likely that many terms occur in few documents.
> > > >
> > > > If I understand you right, I should do the following:
> > > > -Write a HitCollector that simply increments a counter
> > > > -Get the filter for the user query once: new CachingWrapperFilter(new
> > > > QueryWrapperFilter(userQuery));
> > > > -Create a TermQuery for each term
> > > > -Perform the search and read the counter of the HitCollector
> > > >
> > > > I did that, but it didn't get faster. Any ideas why?
> > > >
> > >
> > > This killer is the "TermQuery for each term" part - this is huge. You
> need
> > > to invert this process,
> > > and use your query as is, but while walking in the HitCollector, on
> each
> > > doc
> > > which matches
> > > your query, increment counters for each of the terms in that document
> > > (which
> > > means you need
> > > an in-memory forward lookup for your documents, like a multivalued
> > > FieldCache - and if you've
> > > got roughly the same number of terms as documents, this cache is likely
> to
> > > be as large as
> > > your entire index - a pretty hefty RAM cost).
> > >
> > > But a good thing to keep in mind is that doing this kind of faceting
> > > (massively multivalued
> > > on a huge term-set) requires a lot of computation, even if you have all
> the
> > > proper structures
> > > living in memory:
> > >
> > > For each document you look at (which matches your query), you need to
> look
> > > at all
> > > of the terms in that document, and increment a counter for that term.
>  So
> > > however much
> > > time it would normally take for you to do the driving query, it can
> take as
> > > much as that
> > > multiplied by the average number of terms in a document in your index.
>  If
> > > your documents
> > > are big, this could be a pretty huge latency penalty.
> > >
> > >  -jake
> > >
> >
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message