lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael McCandless <>
Subject Re: detected corrupted index / performance improvement
Date Thu, 07 Feb 2008 10:20:40 GMT

But then you're back to syncing in a BG thread, right?  We've come
full circle.

Asynchronously syncing give the best performance we've seen so far,
and so that's the current patch on LUCENE-1044 (using CMS's threads).
Using a transaction log would also require async. syncing, but then
would also add 2X IO cost of flushing and 2X disk usage between

I don't see how that could be faster.  I expect it to perform quite a
bit worse.

Also, I tested system wide sync in LUCENE-1044 and found it no better
than syncing individual files synchronously (which was our worst
performance number). And I don't think Lucene should be doing a system
wide sync.  There may be other processes doing IO whose buffers we
shouldn't, and don't need to, sync.


robert engels wrote:

> Yes, but this pruning could be more efficient. On a background  
> thread, get current segment from segments file, call the system  
> wide sync ( e.g. System.exec("fsync"), then you can purge the  
> transaction logs for all segments up to that one. Since it is a  
> background operation, you are not blocking the writing of new  
> segments and tx logs.
> On Feb 6, 2008, at 4:42 PM, Michael McCandless wrote:
>> robert engels wrote:
>>> Do we have any way of determining if a segment is definitely OK/ 
>>> VALID ?
>> The only way I know is the CheckIndex tool, and it's rather slow (and
>> it's not clear that it always catches all corruption).
>>> If so, a much more efficient transactional system could be  
>>> developed.
>>> Serialize the updates to a log file. Sync the log. Update the  
>>> lucene index WITHOUT any sync.  Log file writing/sync is VERY  
>>> efficient since it is sequential, and a single file.
>>> Upon open of the index, detect if index was not shutdown cleanly.  
>>> If so, determine the last valid segment, delete the bad segments,  
>>> and then perform the updates (from the log file) since the last  
>>> valid segment was written.
>>> The detection could be a VERY slow operation, but this is ok,  
>>> since it should be rare, and then you will only pay this price on  
>>> the rare occasion, not on every update.
>> Wouldn't you still need to sync periodically, so you can prune the
>> transaction log?  Else your transaction log is growing as fast as the
>> index?  (You've doubled disk usage).
>> Mike
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail:
>> For additional commands, e-mail:
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message