lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Christoph Boosz <christoph.bo...@googlemail.com>
Subject Re: faceted search performance
Date Mon, 12 Oct 2009 17:30:19 GMT
Thanks for your reply.
Yes, it's likely that many terms occur in few documents.

If I understand you right, I should do the following:
-Write a HitCollector that simply increments a counter
-Get the filter for the user query once: new CachingWrapperFilter(new
QueryWrapperFilter(userQuery));
-Create a TermQuery for each term
-Perform the search and read the counter of the HitCollector

I did that, but it didn't get faster. Any ideas why?

Regards,
Chris

2009/10/12 John Wang <john.wang@gmail.com>

> Given you have 1M docs and about 1M terms, do you see very few docs per
> term?
> If your DocSet per term is very sparse, BitSet is probably not a good
> representation. Simple int array maybe better for memory, and faster for
> iterating.
>
> -John
>
> On Mon, Oct 12, 2009 at 8:45 AM, Paul Elschot <paul.elschot@xs4all.nl
> >wrote:
>
> > On Monday 12 October 2009 14:53:45 Christoph Boosz wrote:
> > > Hi,
> > >
> > > I have a question related to faceted search. My index contains more
> than
> > 1
> > > million documents, and nearly 1 million terms. My aim is to get a
> > DocIdSet
> > > for each term occurring in the result of a query. I use the approach
> > > described on
> > >
> >
> http://sujitpal.blogspot.com/2007/04/lucene-search-within-search-with.html
> > <
> >
> https://service.gmx.net/de/cgi/derefer?TYPE=3&DEST=http%3A%2F%2Fsujitpal.blogspot.com%2F2007%2F04%2Flucene-search-within-search-with.html
> > >,
> > > where a BitSet is built out of a QueryFilter for each term and
> > intersected
> > > with the BitSet representing the user query.
> > > However, performance could be better. I guess it’s because the term
> > filter
> > > considers each document in the index, even if it’s not in the result.
> My
> > > attempt to use a ChainedFilter, where the first filter (cached) is for
> > the
> > > user query, and the second one for the term (done for all terms),
> didn’t
> > > speed things up, though.
> > > Am I missing something? Is there a better way to get the DocIdSets for
> a
> > > huge number of terms in a limited set of documents?
> >
> > Assuming you only need the number of documents within the original query
> > that contain each term, one thing that can be saved is the allocation of
> > the
> > resulting BitSet for each term. To do this, use the cached BitSet (or the
> > OpenBitSet in current lucene) for the original Query as a filter for a
> > TermQuery
> > per term, and then count the matching documents by using a counting
> > HitCollector on the IndexSearcher.
> >
> > Regards,
> > Paul Elschot
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message