lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bernhard Messer <Bernhard.Mes...@intrafind.de>
Subject optimized disk usage when creating a compound index
Date Fri, 06 Aug 2004 07:52:57 GMT
hi developers,

i made some measurements on lucene disk usage during index creation. 
It's no surprise that during index creation,  within the index 
optimization, more disk space is necessary than the final index size 
will reach. What i didn't expect is such a high difference in disk size 
usage, switching the compound file option on or off. Using the compound 
file option, the disk usage during index creation is more than 3 times 
higher than the final index size. This could be a pain in the neck, 
running projects like nutch, where huge datasets will be indexed. The 
grow rate relies on the fact that SegmentMerger creates the fully 
compound file first, before deleting the original, unused files.
So i patched SegmentMerger and CompoundFileWriter classes in a way, that 
they will delete the file immediatly after copying the data within the 
compound. The result was, that we could reduce the necessary disk space 
from factor 3 to 2.
The change forces to make some modifications within the TestCompoundFile 
class also. In several test methods the original file was compared to 
it's compound part. Using the modified SegmentMerger and 
CompoundFileWriter, the file was already deleted and couldn't be opened.

Here are some statistics about disk usage during index creation:

compound option is off:
final index size: 380 KB           max. diskspace used: 408 KB
final index size: 11079 KB       max. diskspace used: 11381 KB
final index size: 204148 KB      max. diskspace used: 20739 KB

using compound index:
final index size: 380 KB           max. diskspace used: 1145 KB
final index size: 11079 KB       max. diskspace used: 33544 KB
final index size: 204148 KB      max. diskspace used: 614977 KB

using compound index with patch:
final index size: 380 KB           max. diskspace used: 777 KB
final index size: 11079 KB       max. diskspace used: 22464 KB
final index size: 204148 KB      max. diskspace used: 410829

The change was tested under windows and linux without any negativ side 
effects. All JUnit test cases work fine. In the attachment you'll find 
all the necessary files:

SegmentMerger.java
CompoundFileWriter.java
TestCompoundFile.java

SegmentMerger.diff
CompoundFileWriter.diff
TestCompoundFile.diff

keep moving
Bernhard



Mime
View raw message