lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erik Hatcher <>
Subject Re: Terms given a filter?
Date Thu, 15 Sep 2005 13:19:34 GMT

On Sep 15, 2005, at 5:00 AM, JMA wrote:
> I know I can get all the fields in an index: reader.getFieldNames()
> and also all the terms:  reader.terms()
> However, I need to be able to get all the terms and fields given a  
> search
> filter. For example, say I have an index that has crawled 5000 pdf  
> files
> (books) and I have the following fields:
> content, author (not tokenized), and publish_date
> I can easily find all the *distinct* authors in the index using
> 'reader.terms()'.  But say I want to list all the *distinct*  
> authors that
> have published books in 2002?  I can do a simple search to get all  
> the books
> filtered by publish_date:2002.  But then I have to do my own scan  
> of the
> results and pull out the author, removing duplicates.
> Is there an easier way to do this?

I'm currently building a faceted navigation system (think Google for  
Nineteenth century literature, except with browsing navigation by  
author, date range, genre, and probably some others as it evolves.   
This is very much like the CNET implementation that Chris detailed  

My index is pretty static after it is built, so I cache a lot.  The  
first thing I do is walk all the unique terms (using reader.terms())  
for the faceted fields, and for each one I create a BitSet that has  
set bits corresponding to each document that has that term.  I allow  
the user to build up constraints while navigating with any number of  
these facets, and simply AND the BitSets together to find the  
matching documents.  I also allow for full-text search to occur  
within those constraints, and leverage QueryFilter.bits() in that  
case.  The BitSet's allow me to display how many documents, based on  
the constraints, are in each of the "buckets".

So more to your question - using the scheme I just described, you  
could build up a BitSet for each of the authors.  Then a BitSet for  
2002 (this could be a simple QueryFilter with a TermQuery 
("publish_date", "2002") for example).  AND the BitSet of 2002 to all  
of the author BitSets, and any BitSet with a cardinality > 0 has  
documents for that author.

Make sense?


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message