lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mike Klaas <>
Subject Re: detected corrupted index / performance improvement
Date Fri, 08 Feb 2008 00:16:54 GMT
Oh, it certainly causes some random access--I don't deny that.  I  
just want to emphasize that this isn't at all the same as all "random  
writes", which would be expected to perform an order-mag slower.

Just did a test where I wrote out a 1gig file in 1K chunks.  Then  
wrote it out in 2files, alternating 512 byte chunks, then 4 files/  
256 byte chunks.  Some speed is lost--perhaps 10% at each doubling-- 
but the speed is still essentially "sequential" speed.  You can get  
back the original performance by using consistent sized chunks (1K to  
each file round-robin).

HDD controllers are actually quite good at batching writes into  
sequentially.  Why else do you think sync() takes to long :)


On 7-Feb-08, at 3:35 PM, robert engels wrote:

> I don't think that is true - but I'm probably wrong though :).
> My understanding is that several files are written in parallel  
> (during the merge), causing random access. After the files are  
> written, then they are all reread and written as a CFS file  
> (essential sequential - although the read and write is going to  
> cause head movement).
> The code:
> private IndexOutput tvx, tvf, tvd;              // To write term  
> vectors
> private FieldsWriter fieldsWriter;
> is my clue that several files are written at once.
> On Feb 7, 2008, at 5:19 PM, Mike Klaas wrote:
>> On 7-Feb-08, at 2:00 PM, robert engels wrote:
>>> My point is that commit needs to be used in most applications,  
>>> and the commit in Lucene is very slow.
>>> You don't have 2x the IO cost, mainly because only the log file  
>>> needs to be sync'd.  The index only has to be sync'd eventually,  
>>> in order to prune the logfile - this can be done in the  
>>> background, improving the performance of update and commit cycle.
>>> Also, writing the log file is very efficiently because it is an  
>>> append/sequential operation. Writing the segment files writes  
>>> multiple files - essentially causing random access writes.
>> For large segments, multiple sequentially-written large files  
>> should perform similarly to one large sequentially-written file.   
>> It is only close to random access on the smallest segments (which  
>> a sufficiently-large flush-by-ram shouldn't produce).
>> -Mike
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail:
>> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message