lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From <gol...@detego-software.de>
Subject Re: Re: optimized disk usage when creating a compound index
Date Mon, 09 Aug 2004 12:36:01 GMT

Hi Dmitry,

Thanks for looking into the code.

Dmitry Serebrennikov <dmitrys@earthlink.net> schrieb am 08.08.2004, 
> I'm sorry for juping into this late, but my impression was that the 
> files being deleted were of the new segment, not the files for segments 
> being merged. This, I think, would be ok, because if the operation 
> fails, the old files are still there and the new segment is never 
> entered into the "segments" file and thus the index remains uncorrupted. 
> However, if we delete the previous segments first, we'd have no way of 
> recovering from failure during the merge process.

I delete the old segments only after they have been merged into a new
segment, but before creating the compound file of the new segment.

>     I'm sory, Christoph, but I don't think these changes will work 
> right... I just looked through the current CVS and it seems to me that 
> there is a problem because segmentInfos.write() calls in the IndexWriter 
> end up replacing "segments" file with a new one that puts the newly 
> created segment on-line. Now, if writing of the compound file fails, we 
> end up with a corrupt index.

That's right. When I call merger.createCompoundFile() I can already be
sure, that the new segment is there and is correct, but I cannot be sure
that creation of the compound file will work. Could be very dangerous.

>     Another problem is that the writing of the compound file now happens 
> under the commit.lock, whereas before it happened outside of it. This is 
> potentially a very lengthy operation and will prevent any new 
> IndexReaders from being created for a long time, possibly minutes!
>     And taking the new call to createCompoundFile() out of the lock 
> won't do either because that would mean that IndexReaders could be 
> created during this time, but they would be confused since they will go 
> after the new segment and try to open a half-constructed "cfs" file.

I admit, this is also a serious problem. It has to be changed!

What I wanted to achieve is to prevent up to three copies of a part
(potentially the whole) of an index. These three copies currently are:
1) old segments that are merged
2) New segment generated from merge, but not as compound file
3) Compound file of the new segment

Maybe we can split up the whole process into the following steps:
1) merge segments into a new segment without generating a
compound file
2) write new segment into SegmentInfos and delete old segments
(within a commit lock)
3) Create compound file of new segment without deleting individual
segment files (no commit lock needed); a proper temporary name has
to be chosen for the compound file, 
4) If compound file was generated correctly, delete the individual
files that have been added to the compound file *.tii, ....
and rename the generated compound file (within a commit lock)

What do you think. Would that work?

Christoph

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Mime
View raw message