lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erik Hatcher <>
Subject Re: How to retrieve distinct field matches?
Date Fri, 16 Dec 2005 08:55:36 GMT
This is pretty much the same problem that many of us have faced when  
it comes to faceted browsing.  I'm using a set of cached BitSet's  
that represent the documents that have a specific category (or  
general "facet" in my case).  I do a full-text search for "some query  
expression", using QueryFilter to get the BitSet for the query.  Then  
I AND the Hits BitSet with each of the facet BitSet's, and the  
cardinality of each gives me the number in each category that matches  
the query.  I load up these BitSet's when my search server is  
launched.  In my case I'm currently dealing with about 30k documents,  
with maybe 100 unique facet values, and these load in the blink of an  

I realize the above description was void of code specifics, but the  
gist is there.  Hope it helps.


On Dec 15, 2005, at 8:16 PM, Mr Plate wrote:

> This puzzle has been bugging me for a while; I'm hoping there's an  
> elegant way to handle it in Lucene.
> I've got an index of over 100,000 Documents. In addition to other  
> fields, each of these Documents has 0 or more "category" field  
> values. There are over 5,500 such categories (it's not a small  
> set). Anywhere from 1 to 500+ Documents could belong to a single  
> "category". This index does not get updated very often; anywhere  
> from once a day to once a month. Indexing time is currently 15-30  
> minutes from start to finish/optimization.
> I'd like to provide users a way to search these "category" values.  
> For example, suppose the user searches for "fiction". They might  
> see results of:  { "fiction", "non-fiction" }. However, I'd like to  
> do this search as quickly and efficiently as reasonable. For  
> example, if there are 500 Documents of category "fiction", and 400  
> of "non-fiction", I don't want to Sort and iterate through each Hit  
> to weed out the duplicate values from my query.
> For what it's worth, I imagine only 0-20 categories would match a  
> given query.
> The best I can imagine is to maintain a separate Lucene index for  
> each of these category types. Each Document in this separate index  
> would probably have fields of "field_name", and "field_value", and  
> would not contain any duplicates. For example, you might see a  
> Document of field_name "category" and field_value "non-fiction". My  
> query would hit this second index instead, to perform these  
> metadata searches.
> I hope that makes sense; do you know of a more elegant way to  
> handle this type of problem?
> Thanks,
> Tyler
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message