lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paul Elschot <>
Subject Re: Performance and FS block size
Date Sun, 12 Feb 2006 22:43:58 GMT
On Sunday 12 February 2006 22:48, John Haxby wrote:
> Otis Gospodnetic wrote:
> >I'm somewhat familiar with ext3 vs. ReiserFS stuff, but that's not really 
what I'm after (finding a better/faster FS).  What I'm wondering is about 
different block sizes on a single (ext3) FS.
> >If I understand block sizes correctly, they represent a chunk of data that 
the FS will read in a single read.
> >- If the block size is 1K, and Lucene needs to read 4K of data, then the 
disk will have to do 4 reads, and will read in a total of 4K.
> >- If the block size is 4K, and Lucene needs to read 3K of data, then the 
disk will have to do 1 read, and will read a total of 3K, although that will 
actually consume 4K, because that's the size of a block.
> >  
> >
> That's correct Otis.   Applications generally to get best performance 
> when they read data in the file system block size (or small multiples 
> thereof) which for ext2 and ext3 is almost always 4k.  It might be 
> interesting to try making file systems with different block sizes and 
> see what the effect on performance is and also, perhaps trying larger 
> block sizes in Lucene, but always keeping Lucene's block size a multiple 
> of the file system block size.   For an educated guess, I'd say that 
> 4k/4k gives better performance than smaller file system block sizes and 
> 8k/4k is not likely to have much of an effect either way.
> >Does any of this sound right?
> >I recall Paul Elschot talking about disk reads and disk arm movement, and 
Robert Engels talking about Nio and block sizes, so they might know more 
about this stuff.
> >  
> >
> It depends very much on the type of disk: 15,000 rpm ultra-scsi 320 
> disks on a 64 bit PCI card will probably be faster than a 4200rpm disk 
> in a laptop :-)   Seriously, disk configuration makes a lot of 
> difference: striped RAID arrays will give the best I/O performance 
> (given a  controller and whatnot that can exploit that).   Once you get 
> into huge amount of I/O there are other, more complex issues that affect 
> performance.
> java.nio has the right features to exploit the I/O subsystem of the OS 
> to good advantage.   We haven't done the performance measurements yet, 
> but memory mappied I/O should yield the best performance (as well as 
> freeing you from worrying about what block size is best).    It will 
> also be interesting to try the different I/O schedulers under Linux: cfq 
> is the default for the 2.6 kernel that Red Hat ships, but I can imagine 
> the deadline scheduler may give interesting results.   As I say, at some 
> stage over the next few months we're likely to be looking at this in 
> more detail.
> The one thing that makes more difference than anything else though is 
> locality of reference; this seems to well understood by the Lucene index 
> format and is probably why the performance is generall good!

IndexReader.doc(docId) for more than 2 docs is normally best done with
increasing docId. This reduces disk head movement, since the stored docs
are in that order.
When Hits() its used, it is tempting to retrieve docs in scoring order via 
the Hits.doc() method, but that is probably not the best order for retrieval

Paul Elschot

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message