lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ken Krugler <>
Subject Re: BufferedIndexInput.readByte performance
Date Fri, 26 May 2006 17:13:18 GMT
>On Friday 26 May 2006 16:14, Michael Chan wrote:
>>  Hi,
>  > I have a 5gb index containing 2mil documents and am trying to run
>>  1mil+ queries against it. Most of the queries are SpanQueries and it
>>  occurs to me that the search performance is quite slow when using 2, 3
>>  SpanOrQueries nested inside a SpanNearQuery, which in turn is nested
>>  inside another SpanNearQuery. The response time is around 3-5 seconds
>>  even when the index is stored as a RAMDirectory, and
>>  BufferedIndexInput.readByte() appears to be the bottleneck. Is this
>  > performance typical? As I don't need any sorting of the results and
>That is indeed typical.
>  > only need the number of results returned, is there anything, besides
>>  Field.setOmitNorm(true), I can modify to improve performance?
>A few things might help:
>- use getSpans() on the scorer of the query, iterate the resulting Spans
>   and count the number of different doc values.
>   This saves the scoring and the sorting on score value.
>- Sort the queries alphabetically, to try and maximize cache usage.
>- Increase the skip interval when creating the index, by default lucene uses
>   16, but nutch uses a higher value. I've never done this myself, but
>   you could specifically ask on how to do this.

I'm curious how increasing the skip interval would improve 
performance. From what I've read on-list and in-code, having a 
smaller skip interval means trading more memory for faster term 
lookup. I think Nutch uses a larger default value (128) to better 
handle big (e.g. 10M) indexes, at the expense of slightly slower 

Also, the use of a sorted index would seem to offer the biggest 
potential speed win, though that would require a good way of ordering 
the index, and it seems like there would be lots of tuning required 
to pick the right cut-off value for searches.


-- Ken
Ken Krugler
Krugle, Inc.
+1 530-210-6378
"Find Code, Find Answers"

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message