lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mr Plate <>
Subject How to retrieve distinct field matches?
Date Fri, 16 Dec 2005 01:16:45 GMT
This puzzle has been bugging me for a while; I'm hoping there's an  
elegant way to handle it in Lucene.


I've got an index of over 100,000 Documents. In addition to other  
fields, each of these Documents has 0 or more "category" field  
values. There are over 5,500 such categories (it's not a small set).  
Anywhere from 1 to 500+ Documents could belong to a single  
"category". This index does not get updated very often; anywhere from  
once a day to once a month. Indexing time is currently 15-30 minutes  
from start to finish/optimization.


I'd like to provide users a way to search these "category" values.  
For example, suppose the user searches for "fiction". They might see  
results of:  { "fiction", "non-fiction" }. However, I'd like to do  
this search as quickly and efficiently as reasonable. For example, if  
there are 500 Documents of category "fiction", and 400 of "non- 
fiction", I don't want to Sort and iterate through each Hit to weed  
out the duplicate values from my query.

For what it's worth, I imagine only 0-20 categories would match a  
given query.


The best I can imagine is to maintain a separate Lucene index for  
each of these category types. Each Document in this separate index  
would probably have fields of "field_name", and "field_value", and  
would not contain any duplicates. For example, you might see a  
Document of field_name "category" and field_value "non-fiction". My  
query would hit this second index instead, to perform these metadata  

I hope that makes sense; do you know of a more elegant way to handle  
this type of problem?



To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message