lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Simeon Koptelov <skopte...@fis.ru>
Subject Re: Document numbers and ids
Date Mon, 07 Feb 2005 09:25:12 GMT
On Sunday 06 February 2005 20:00, Chris Hostetter wrote:
> : > care about their content. I only want to know a particular numeric
> : > field from
> : > document (id of document's category).
> : > I also need to know how many docs in category were found, so I can't
> : > index
> :
> : You should explore the use of IndexReader.  Index your documents with
> : category id field, and use the methods on IndexReader to find all
> : unique categories (TermEnum).
>
> to expand on erik's suggestion: once you know the complete list of
> categories you iterate over then and execute your search once per
> category, filtering each time on the category Id (to determine the number
> of results from that category).

Nah, I did a little more tricky thing, but promises to be faster (I have 12K 
categories now and there will be more).
I index docs' categories ids as zero-padded keywords. Then I do search for 
documents, sorting them by category id. Then I iterate Hits following the 
scheme: 
1. I have the cache that holds ids of documents in current category.
2. Each time I see doc id that is not in current category, I read that 
document and reload cache with it's category data. 

So if I found docs in N categories (N usually is not big), I really need to 
read exactly N docs from disk, the rest of iterating through Hits is just 
checking cache (because I sort by category).

It's a pity lucene doesn't have IndexSearcher.search( Query, Sort, 
HitCollector ), but if I understood Hits properly, it gives me O( log2
( doc_dum ) ) performance impact per resultset, which is perfectly 
acceptable.

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message