lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bruce Ritchie <br...@jivesoftware.com>
Subject Re: Caching filter wrapper (was Re: RE : DateFilter.Before/After)
Date Tue, 16 Sep 2003 17:54:18 GMT
Doug Cutting wrote:
>> for (int i = 0; i < numResults; i++) {
>>    ids[i] = Long.parseLong((hits.doc(i)).get("messageID"));
>> }
> 
> This is not a recommended way to use Lucene.  The intent is that you 
> should only have to call Hits.doc() for documents that you actually 
> display, usually around 10 per query.  Is this still a bottleneck when 
> you fetch a max of 10 or 20 documents?

I didn't test this case.

> So I'd be interested to hear why you need 1500 hits.  My guess is that 
> you're doing post-processing of hits, then selecting 10 or so to 
> actually display.  If you can figure out a way to do this post 
> processing without accessing the document object, i.e., through the 
> query, a custom HitCollector, or the SearchBean, then this optimization 
> is probably not needed.

We would dearly love to not have to post-process results returned from lucene. Unfortunately,
we 
can't foresee a way to do this given the current architecture of our applications and Lucene.
The 
issue is that we must both exclude search results based upon an external (to lucene) permission

system and be able to sort results based upon criteria(s) that again can't be stored inside
lucene 
(document rating is an example). Neither the permissions nor the external sort criteria(s)
can be 
stored in lucene because they can impact too many documents when they change (1 permission
change 
could require 'updating' a field in every document in the lucene store) or change too often
(it's 
quite probable that a document rating will change every time a document is viewed for example).

The only way I foresee that we could internalize both of these factors into lucene is if it
was 
possible to modify a document inside of lucene at basically no cost. Since that's not currently

possible, we are stuck with retrieving all the documents from lucene and post-processing them.
Even 
if updating a document was possible we might decide that it's just not worth it to store some

document attributes in lucene from an overall performance perspective. There may of course
be other 
possible solutions however we haven't yet thought of them

> A 30% optimization to a slow algorithm is better than nothing, but it 
> would be better yet to improve the algorithm.  That said, this sort of 
> improvement is not always trivial, and lots of people use Lucene in the 
> way that you have, so it's still may be worth optimizing this.

30% on my machine - I think it's likely to be quite a bit faster when the lucene files are
stripped 
across multiple disks. I can't test that assumption though as I don't have the hardware available.
I 
believe the speedup is beneficial in almost all situations and the cost associated with the

optimization is quite minimal, especially when compared to the alternative (slow searches
under 
heavy load or more memory usage/file descriptors through multiple readers).


Regards,

Bruce Ritchie

Mime
View raw message