lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "David Seltzer" <dselt...@TVEyes.com>
Subject RE: Faceting, Sort and DocIDSet
Date Mon, 20 Apr 2009 14:53:49 GMT
Robert,

99% of the documents are inserted as soon as we discover them, so the INDEXORDER is largely
correct. However, two factors keep me from using INDEXORDER. The first is that a small portion
of our records (1%) enter the index late (so they appear out of order with respect to the
other 99% of the index). The other factor is that we use a ParallelMultiSearcher to search
several month-long indexes. In this scenario, I'm not sure what INDEXORDER means.

Is INDEXORDER based on the DocumentID within each individual index? If so then the results
could be interleaved. Anyone know how this behaves?

Thanks,

-Dave

-----Original Message-----
From: Robert Muir [mailto:rcmuir@gmail.com] 
Sent: Monday, April 20, 2009 10:26 AM
To: java-user@lucene.apache.org
Subject: Re: Faceting, Sort and DocIDSet

David,

One suggestion I have for your large index. Is it possible to index these
documents ordered by Date? (and ingest new docs in Date order?)

This way index order = Date order, you can do this sort very quickly by
using Sort.INDEXORDER

with huge indexes I try to see if there's a way i can have the index sorted
in some meaningful way so I can use this trick for the most common sort
case.

hope this helps,
Robert

On Mon, Apr 20, 2009 at 10:12 AM, David Seltzer <dseltzer@tveyes.com> wrote:

> Hi Karsten,
>
> My index contains about 100M documents, and I'm trying to count results
> on around 300 facets. At the moment I'm keeping a set of cached facet
> bitsets and then comparing the query result against those bitsets.
> Performance is pretty lousy. It takes more than 2s to calculate the
> cardinality of the main query against those 300 facets.
>
> I have two possible datasets to use for the facets. One is an integer
> and the other is a short string (about 10 characters).
>
> The taxonomy solution seems interesting but it might be overkill since
> there is really no hierarchical relationship between these facets.
>
> I could count the facets manually by implementing a hitcollector, but
> the javadocs warn (pretty strenuously) about reading the content of a
> document inside a hitcollector. Is this something I should be worried
> about, or is it an inevitable part of the solution?
>
> Thanks!
>
> -Dave
>
> -----Original Message-----
> From: Karsten F. [mailto:karsten-lucene@fiz-technik.de]
> Sent: Saturday, April 18, 2009 10:58 AM
> To: java-user@lucene.apache.org
> Subject: Re: Faceting, Sort and DocIDSet
>
>
> Hi Dave,
>
> searching and sorting in lucene are two separate functions (if you not
> want
> to sort by relevance).
> You will not loss performance if you first search with BitSet as
> HitCollector and then sort the result by DateField.
> But more easy is to extend TopFieldDocCollector/TopFieldCollector to a
> Collector with facet count.
>
> Sujit Pal's implementation of facet count is a good idea if you have a
> small
> amount of facets and a lot documents for each facet.
>
> I know half a dozen of implementations of facet browsing.
> To choose the best you have to know:
>  - How many different values have the facet? Which kind of value
> (Integer,
> small String, huge String)?
>  - More then one value of the facet per document/how many in average?
>
> Possible
> http://www.nabble.com/Taxonomy-in-Lucene-td20929487.html
> is also interesting for you.
>
> Best regards
>  Karsten
>
>
> David Seltzer wrote:
> >
> > I have a set of indexes, each index contains a month's worth of
> > Articles. I need to be able to search the index (sorting by date) and
> > then apply access-filters based on the Article Source. I'm also trying
> > to get result counts for each Article Source.
> > So my questions:
> > 1) How do I use a HitCollector and sort by a field?
> > 2) Is using BitSets the wrong way to quickly generate facet counts?
> I've
> > read about DocIDSets, but I'm not sure how to use them in the same
> way.
> > (I'm basing my faceting technique on Sujit Pal's article
> >
> http://sujitpal.blogspot.com/2007/04/lucene-search-within-search-with.ht
> > ml)
> >
> > Thanks!
> >
> > -Dave
> >
>
> --
> View this message in context:
> http://www.nabble.com/Faceting%2C-Sort-and-DocIDSet-tp23099854p23113784.
> html<http://www.nabble.com/Faceting%2C-Sort-and-DocIDSet-tp23099854p23113784.%0Ahtml>
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>


-- 
Robert Muir
rcmuir@gmail.com
Mime
View raw message