lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From mattspitz <>
Subject Appropriate disk optimization for large index?
Date Sat, 16 Aug 2008 08:07:52 GMT

Hi!  I'm using Lucene 2.3.2 to store a relatively-large index of HTML
documents.  I'm storing ~150 million documents, taking up 150 GB of space.

I index the HTML text, but I only store primary key information that allows
me to retrieve it later.  Thus, my document size is small, but obviously, I
need to store the index as well, and I imagine that's what takes up almost
all of the space.

Since I allow users to search this HTML specific to their user index, I
create multiple indexes (~2000) such that a given user only has to search
one of the 2000 indexes to get to their specific document.  I also have
queries that span all 2000 indexes.

So, I have 2000 indexes full of small documents but relatively large
indexing space.

My question is what sort of disk to buy.  Using "dstat", I've determined
that the disk is clearly the bottleneck.  Nearly all the time I spend
indexing "chunks" of documents and committing them to disk is spent waiting
on I/O operations.  I spawn multiple threads to access the various index
writers so as to minimize I/O wait time, but disk always ends up being the

Currently, I've got 7200rpm SATA drives (RAID 0), but I've also got 15k SAS
drives (RAID 0 as well) on hand.

My question is, what's the access pattern of Lucene when it comes to
indexing documents, merging segments, and eventually optimizing them (given
what I've mentioned about document count and document size)?

Am I better off with a drive that has a faster seek time, or do I need to
optimize for sustained throughput?  How does the way in which Lucene lays
indexes on disk affect this?

If it helps, my merge factor is 50, and given that I run out of file
descriptors otherwise, I use the compound file format.

Thanks for your help,
View this message in context:
Sent from the Lucene - Java Users mailing list archive at

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message