lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doug Cutting <cutt...@lucene.com>
Subject Re: Caching filter wrapper (was Re: RE : DateFilter.Before/After)
Date Tue, 16 Sep 2003 16:26:37 GMT
Bruce Ritchie wrote:
> The times shown above is only the time taken to call the following code 
> (numResults is a max of 1500 or hits.length(), whichever is smaller):
> 
> for (int i = 0; i < numResults; i++) {
>    ids[i] = Long.parseLong((hits.doc(i)).get("messageID"));
> }

This is not a recommended way to use Lucene.  The intent is that you 
should only have to call Hits.doc() for documents that you actually 
display, usually around 10 per query.  Is this still a bottleneck when 
you fetch a max of 10 or 20 documents?

So I'd be interested to hear why you need 1500 hits.  My guess is that 
you're doing post-processing of hits, then selecting 10 or so to 
actually display.  If you can figure out a way to do this post 
processing without accessing the document object, i.e., through the 
query, a custom HitCollector, or the SearchBean, then this optimization 
is probably not needed.

A 30% optimization to a slow algorithm is better than nothing, but it 
would be better yet to improve the algorithm.  That said, this sort of 
improvement is not always trivial, and lots of people use Lucene in the 
way that you have, so it's still may be worth optimizing this.

If your post-processsing is done in order to sort the results, then I 
recommend trying the SearchBean, in the Lucene sandbox.  I've never used 
it myself, but it is able to provide results sorted by any field without 
accessing the document object of each hit while the query is processed 
(it caches tables of field values when constructed).  Examining the 
SearchBean code, I see an optimization: it would be more efficient if it 
used a HitCollector rather than a Hits when sorting, as the Hits may 
have to re-query a few times to get the full set of results, but even 
with that, I suspect you'd see a speedup.

I wonder if SearchBean, or something like it, should be added to the 
core?  This is something lots of folks ask for.  SearchBean's technique 
can use a fair amount of memory, but most folks are not short on RAM 
these days.  One could optimize SearchBean's sorting for integer-valued 
fields, but that could also be done after it is added to the core.

What do folks think about adding SearchBean to the core?  Perhaps it 
could be merged with the existing Hits code, as a primary API for 
accessing search results?

Doug


Mime
View raw message