lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robert Muir <rcm...@gmail.com>
Subject Re: Faceting, Sort and DocIDSet
Date Mon, 20 Apr 2009 14:26:11 GMT
David,

One suggestion I have for your large index. Is it possible to index these
documents ordered by Date? (and ingest new docs in Date order?)

This way index order = Date order, you can do this sort very quickly by
using Sort.INDEXORDER

with huge indexes I try to see if there's a way i can have the index sorted
in some meaningful way so I can use this trick for the most common sort
case.

hope this helps,
Robert

On Mon, Apr 20, 2009 at 10:12 AM, David Seltzer <dseltzer@tveyes.com> wrote:

> Hi Karsten,
>
> My index contains about 100M documents, and I'm trying to count results
> on around 300 facets. At the moment I'm keeping a set of cached facet
> bitsets and then comparing the query result against those bitsets.
> Performance is pretty lousy. It takes more than 2s to calculate the
> cardinality of the main query against those 300 facets.
>
> I have two possible datasets to use for the facets. One is an integer
> and the other is a short string (about 10 characters).
>
> The taxonomy solution seems interesting but it might be overkill since
> there is really no hierarchical relationship between these facets.
>
> I could count the facets manually by implementing a hitcollector, but
> the javadocs warn (pretty strenuously) about reading the content of a
> document inside a hitcollector. Is this something I should be worried
> about, or is it an inevitable part of the solution?
>
> Thanks!
>
> -Dave
>
> -----Original Message-----
> From: Karsten F. [mailto:karsten-lucene@fiz-technik.de]
> Sent: Saturday, April 18, 2009 10:58 AM
> To: java-user@lucene.apache.org
> Subject: Re: Faceting, Sort and DocIDSet
>
>
> Hi Dave,
>
> searching and sorting in lucene are two separate functions (if you not
> want
> to sort by relevance).
> You will not loss performance if you first search with BitSet as
> HitCollector and then sort the result by DateField.
> But more easy is to extend TopFieldDocCollector/TopFieldCollector to a
> Collector with facet count.
>
> Sujit Pal's implementation of facet count is a good idea if you have a
> small
> amount of facets and a lot documents for each facet.
>
> I know half a dozen of implementations of facet browsing.
> To choose the best you have to know:
>  - How many different values have the facet? Which kind of value
> (Integer,
> small String, huge String)?
>  - More then one value of the facet per document/how many in average?
>
> Possible
> http://www.nabble.com/Taxonomy-in-Lucene-td20929487.html
> is also interesting for you.
>
> Best regards
>  Karsten
>
>
> David Seltzer wrote:
> >
> > I have a set of indexes, each index contains a month's worth of
> > Articles. I need to be able to search the index (sorting by date) and
> > then apply access-filters based on the Article Source. I'm also trying
> > to get result counts for each Article Source.
> > So my questions:
> > 1) How do I use a HitCollector and sort by a field?
> > 2) Is using BitSets the wrong way to quickly generate facet counts?
> I've
> > read about DocIDSets, but I'm not sure how to use them in the same
> way.
> > (I'm basing my faceting technique on Sujit Pal's article
> >
> http://sujitpal.blogspot.com/2007/04/lucene-search-within-search-with.ht
> > ml)
> >
> > Thanks!
> >
> > -Dave
> >
>
> --
> View this message in context:
> http://www.nabble.com/Faceting%2C-Sort-and-DocIDSet-tp23099854p23113784.
> html<http://www.nabble.com/Faceting%2C-Sort-and-DocIDSet-tp23099854p23113784.%0Ahtml>
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>


-- 
Robert Muir
rcmuir@gmail.com

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message