lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doug Cutting <>
Subject Re: Lucene optimization with one large index and numerous small indexes.
Date Tue, 30 Mar 2004 17:03:18 GMT
Esmond Pitt wrote:
> Don't want to start a buffer size war, but these have always seemed too
> small to me. I'd recommend upping both InputStream and OutputStream buffer
> sizes to at least 4k, as this is the cluster size on most disks these days,
> and also a common VM page size.


> Reading and writing in smaller quantities
> than these is definitely suboptimal.

This is not obvious to me.  Can you provide Lucene benchmarks which show 
this?  Modern filesystems have extensive caches, perform read-ahead and 
delay writes.  Thus file-based system calls do not have a close 
correspondence to physical operations.

To my thinking, the primary role of file buffering in Lucene is to 
minimize the overhead of the system call itself, not to minimize 
physical i/o operations.  Once the overhead of the system call is made 
insignificant, larger buffers offer little measurable improvement.

Also, we cannot increase the size of these blindly.  Buffers are the 
largest source of per-query memory allocation in Lucene, with one (or 
two for phrases and spans) allocated for every query term.  Folks whose 
applications perform wildcard queries have encountered out-of-memory 
exceptions with the current buffer size.

Possibly one could implement a term wildcard mechanism which does not 
require a buffer per term, or perhaps one could allocate small buffers 
for infrequent terms (the vast majority).  If such changes were made 
then it might be feasable to bump up the buffer size somewhat.  But, 
back to my first point, one must first show that larger buffers offer 
significant performance improvements.


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message