lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From robert engels <>
Subject Re: detected corrupted index / performance improvement
Date Thu, 07 Feb 2008 22:00:05 GMT
My point is that commit needs to be used in most applications, and  
the commit in Lucene is very slow.

You don't have 2x the IO cost, mainly because only the log file needs  
to be sync'd.  The index only has to be sync'd eventually, in order  
to prune the logfile - this can be done in the background, improving  
the performance of update and commit cycle.

Also, writing the log file is very efficiently because it is an  
append/sequential operation. Writing the segment files writes  
multiple files - essentially causing random access writes.

I guess I don't see the benefit of 1044 if you can't guarantee the  
index is at a certain point (you can by calling commit(), but it is  
VERY slow).

I was thinking a better design is to serialize the documents/ 
operations to disk, and maintain an in memory index of updates/ 
removes, and then merge those indexes to the main when needed - using  
a parallel reader on both in the mean-time.

On Feb 7, 2008, at 3:06 PM, Michael McCandless wrote:

> robert engels wrote:
>> I might be misunderstanding 1044.  There were several approaches,  
>> and I am not certain what was the final???
> The final approach (take 7) is to make the index consistent (sync  
> the files) after finishing a merge.  Also, a new method ("commit")  
> is added which will force a synchronous sync while you wait.  Close  
> also does this.
>> I reread the bug and am still a bit unclear.
>> If the segments are sync'd as part of the commit, then yes, that  
>> would suffice. The merges don't need to commit, you just can't  
>> delete the segments until the merge completes.
>> I  think that building the segments, and syncing each segment -  
>> since in most cases the caller is going to call commit as part of  
>> each update, is going to be slower than writing the documents/ 
>> operations to a log file, but a lot depends on how Lucene is used  
>> (interactive vs. batch, lots of updates vs. a few).
> Well, and based on how frequently you prune the transaction log  
> (sync the real files).  I think the 2X IO cost is going to make  
> performance worse with the transaction log.
>> I am not sure how deletions are impacted by all of this.
> Should be fine?  The *.del files need to be sync'd just like the  
> rest of the segments files.
> Mike
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message