lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dmitry Serebrennikov <dmit...@earthlink.net>
Subject Re: optimized disk usage when creating a compound index
Date Sun, 08 Aug 2004 21:04:55 GMT
Christoph Goller wrote:

> Bernhard Messer wrote:
>
>> Hi Christoph,
>>
>> just reviewed the TestCompoundFile.java and you where absolutly right 
>> when saying that the test will fail on windows.  No the test is 
>> changed in a way that a second file with identical data is created. 
>> This file can be used in the testcases to make the comparisons 
>> against the compound store. Now the modified test runs fine on 
>> Microsoft and Linux platforms.
>>
>> In the attachment you'll find the new TestCompoundFile source.
>>
>> hope this helps
>> Bernhard
>
>
> Hi Bernhard,
>
> I reconsidered your chances again.
> The problem that is solved is the following:
>
> If compound files are used, Lucene needs up to 3 times the disk space 
> (during
> indexing) that is required by the final index. The reason is that 
> during a
> merge of mergeFactor segments, these segments are doubled by merging 
> them into a
> new one and then the new segment is doubled again while generating its 
> compound
> file.
>
> You solved the problem by deleting individual files from a segment 
> earlier while
> building the compound file. However, this means that the 
> CompoundFileWriter in
> its close operation now deletes files. This is not necessarily what 
> one expects
> if one uses a CompoundFileWriter. It should only generate a compound 
> file, not delete the original files. Therefore you had to change 
> CompoundFileWriter tests
> accordingly!

I'm sorry for juping into this late, but my impression was that the 
files being deleted were of the new segment, not the files for segments 
being merged. This, I think, would be ok, because if the operation 
fails, the old files are still there and the new segment is never 
entered into the "segments" file and thus the index remains uncorrupted. 
However, if we delete the previous segments first, we'd have no way of 
recovering from failure during the merge process.

>
> My idea now is to change IndexWriter so that during merge all old 
> segments are
> deleted before the compound file is generated. I think that I also 
> avoid the
> factor of 3 and get a maximum factor of 2 concerning disk space. I 
> committed my
> changes. Could you do a test as you did with your patch to verify if 
> my changes have the desired outcome too? That would be great,

    I'm sory, Christoph, but I don't think these changes will work 
right... I just looked through the current CVS and it seems to me that 
there is a problem because segmentInfos.write() calls in the IndexWriter 
end up replacing "segments" file with a new one that puts the newly 
created segment on-line. Now, if writing of the compound file fails, we 
end up with a corrupt index.
    Another problem is that the writing of the compound file now happens 
under the commit.lock, whereas before it happened outside of it. This is 
potentially a very lengthy operation and will prevent any new 
IndexReaders from being created for a long time, possibly minutes!
    And taking the new call to createCompoundFile() out of the lock 
won't do either because that would mean that IndexReaders could be 
created during this time, but they would be confused since they will go 
after the new segment and try to open a half-constructed "cfs" file.

    Again, I'm sorry, but I think I have to -1 these changes.

    -1.



Dmitry.

>
> Christoph
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Mime
View raw message