lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dmitry Serebrennikov <>
Subject Re: optimized disk usage when creating a compound index
Date Mon, 09 Aug 2004 17:15:20 GMT wrote:

>Hi Dmitry,
>Thanks for looking into the code.
>Dmitry Serebrennikov <> schrieb am 08.08.2004, 
>>I'm sorry for juping into this late, but my impression was that the 
>>files being deleted were of the new segment, not the files for segments 
>>being merged. This, I think, would be ok, because if the operation 
>>fails, the old files are still there and the new segment is never 
>>entered into the "segments" file and thus the index remains uncorrupted. 
>>However, if we delete the previous segments first, we'd have no way of 
>>recovering from failure during the merge process.
>I delete the old segments only after they have been merged into a new
>segment, but before creating the compound file of the new segment.
>>    I'm sory, Christoph, but I don't think these changes will work 
>>right... I just looked through the current CVS and it seems to me that 
>>there is a problem because segmentInfos.write() calls in the IndexWriter 
>>end up replacing "segments" file with a new one that puts the newly 
>>created segment on-line. Now, if writing of the compound file fails, we 
>>end up with a corrupt index.
>That's right. When I call merger.createCompoundFile() I can already be
>sure, that the new segment is there and is correct, but I cannot be sure
>that creation of the compound file will work. Could be very dangerous.
>>    Another problem is that the writing of the compound file now happens 
>>under the commit.lock, whereas before it happened outside of it. This is 
>>potentially a very lengthy operation and will prevent any new 
>>IndexReaders from being created for a long time, possibly minutes!
>>    And taking the new call to createCompoundFile() out of the lock 
>>won't do either because that would mean that IndexReaders could be 
>>created during this time, but they would be confused since they will go 
>>after the new segment and try to open a half-constructed "cfs" file.
>I admit, this is also a serious problem. It has to be changed!
>What I wanted to achieve is to prevent up to three copies of a part
>(potentially the whole) of an index. These three copies currently are:
>1) old segments that are merged
>2) New segment generated from merge, but not as compound file
>3) Compound file of the new segment
>Maybe we can split up the whole process into the following steps:
>1) merge segments into a new segment without generating a
>compound file
>2) write new segment into SegmentInfos and delete old segments
>(within a commit lock)
>3) Create compound file of new segment without deleting individual
>segment files (no commit lock needed); a proper temporary name has
>to be chosen for the compound file, 
>4) If compound file was generated correctly, delete the individual
>files that have been added to the compound file *.tii, ....
>and rename the generated compound file (within a commit lock)
Well, I think this could work, but I'm not sure how this will behave if 
an IndexReader is created on the new segment while it is still 
uncompound. Then when you try to delete the individual files, you'd have 
to implement something like "deletable" file for segments (to work with 
Windows file locking).
Anyway, what do you think of the original way proposed by Bernard? I 
think that method was ok. If I understand correctly, in that method the 
merge process does not end until compound file is created (as before), 
but the files are deleted as they are merged in. I suppose there is a 
chance that the compound file creation process fails and we would not 
have any new segment since the files that were useable would have been 
half deleted. Is that what's bothering you in this solution? To me this 
seems acceptable because it shouldn't happen frequently. What do you 
think? Is there anything I'm missing about Bernard's solution?

(By the way, Thanks for helping to maintain and improve this code!)

>What do you think. Would that work?
>To unsubscribe, e-mail:
>For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message