lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael D. Curtin" <>
Subject Re: How to retrieve distinct field matches?
Date Fri, 16 Dec 2005 02:03:06 GMT
Mr Plate wrote:

> This puzzle has been bugging me for a while; I'm hoping there's an  
> elegant way to handle it in Lucene.
> I've got an index of over 100,000 Documents. In addition to other  
> fields, each of these Documents has 0 or more "category" field  values. 
> There are over 5,500 such categories (it's not a small set).  Anywhere 
> from 1 to 500+ Documents could belong to a single  "category". This 
> index does not get updated very often; anywhere from  once a day to once 
> a month. Indexing time is currently 15-30 minutes  from start to 
> finish/optimization.
> I'd like to provide users a way to search these "category" values.  For 
> example, suppose the user searches for "fiction". They might see  
> results of:  { "fiction", "non-fiction" }. However, I'd like to do  this 
> search as quickly and efficiently as reasonable. For example, if  there 
> are 500 Documents of category "fiction", and 400 of "non- fiction", I 
> don't want to Sort and iterate through each Hit to weed  out the 
> duplicate values from my query.
> For what it's worth, I imagine only 0-20 categories would match a  given 
> query.
> The best I can imagine is to maintain a separate Lucene index for  each 
> of these category types. Each Document in this separate index  would 
> probably have fields of "field_name", and "field_value", and  would not 
> contain any duplicates. For example, you might see a  Document of 
> field_name "category" and field_value "non-fiction". My  query would hit 
> this second index instead, to perform these metadata  searches.
> I hope that makes sense; do you know of a more elegant way to handle  
> this type of problem?

I'm guessing that each Document doesn't have a "category" field with 
multiple values in it but, instead, has a uniquely-named field for each 
category.  Would it work to change your data model to the former?  That 
is, have a Text field named "category" in each document, so that it gets 
tokenized and indexed.  Then you could do a search of the 5K category 
names (outside of Lucene, perhaps by getting the list of Terms from the 
"category" field) for the query term of interest, "fiction" in your 
example, then compose a Lucene query with the results.  Your example 
would produce a query equivalent to 'category:fiction 
category:non-fiction'.  For only 100K documents, this should be pretty fast.

Good luck!


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message