lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael McCandless <>
Subject Re: detected corrupted index / performance improvement
Date Fri, 08 Feb 2008 10:14:36 GMT

Mike, you're right: all lucene files are written sequentially
(flushing or merging).

It's just a matter of how many are open at once, and whether we are
also reading from source(s) files, which affects IO throughput far
less than truly random access writes.

Plus, as of LUCENE-843, bytes are written to tvx/tvd/tvf and fdx/fdt
"as we go", which is better because we get the bytes to the OS earlier
so it can properly schedule their arrival to stable storage.  So by
the time we flush a segment, the OS should have committed most of
those bytes.

When writing a segment, we write fnm, then open tii/tis/frq/prx at
once and write (sequentially) to them, then write to nrm.

Merging is far more IO intensive.  With mergeFactor=10, we read from
40 input streams and write to 4 output streams when merging the
tii/tis/frq/prx files.


Mike Klaas wrote:

> Oh, it certainly causes some random access--I don't deny that.  I  
> just want to emphasize that this isn't at all the same as all  
> "random writes", which would be expected to perform an order-mag  
> slower.
> Just did a test where I wrote out a 1gig file in 1K chunks.  Then  
> wrote it out in 2files, alternating 512 byte chunks, then 4 files/  
> 256 byte chunks.  Some speed is lost--perhaps 10% at each doubling-- 
> but the speed is still essentially "sequential" speed.  You can get  
> back the original performance by using consistent sized chunks (1K  
> to each file round-robin).
> HDD controllers are actually quite good at batching writes into  
> sequentially.  Why else do you think sync() takes to long :)
> -Mike
> On 7-Feb-08, at 3:35 PM, robert engels wrote:
>> I don't think that is true - but I'm probably wrong though :).
>> My understanding is that several files are written in parallel  
>> (during the merge), causing random access. After the files are  
>> written, then they are all reread and written as a CFS file  
>> (essential sequential - although the read and write is going to  
>> cause head movement).
>> The code:
>> private IndexOutput tvx, tvf, tvd;              // To write term  
>> vectors
>> private FieldsWriter fieldsWriter;
>> is my clue that several files are written at once.
>> On Feb 7, 2008, at 5:19 PM, Mike Klaas wrote:
>>> On 7-Feb-08, at 2:00 PM, robert engels wrote:
>>>> My point is that commit needs to be used in most applications,  
>>>> and the commit in Lucene is very slow.
>>>> You don't have 2x the IO cost, mainly because only the log file  
>>>> needs to be sync'd.  The index only has to be sync'd eventually,  
>>>> in order to prune the logfile - this can be done in the  
>>>> background, improving the performance of update and commit cycle.
>>>> Also, writing the log file is very efficiently because it is an  
>>>> append/sequential operation. Writing the segment files writes  
>>>> multiple files - essentially causing random access writes.
>>> For large segments, multiple sequentially-written large files  
>>> should perform similarly to one large sequentially-written file.   
>>> It is only close to random access on the smallest segments (which  
>>> a sufficiently-large flush-by-ram shouldn't produce).
>>> -Mike
>>> -------------------------------------------------------------------- 
>>> -
>>> To unsubscribe, e-mail:
>>> For additional commands, e-mail:
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message