lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Christoph Goller <>
Subject Re: optimized disk usage when creating a compound index
Date Fri, 13 Aug 2004 08:47:12 GMT
Dmitry Serebrennikov wrote:
> Hi Christoph,
> I agree that your approach achieves better disk usage than deleting 
> segments as they are being merged into the compound file, chiefly 
> because most indexes have one or two large files and the rest are small. 
> I have not reviewed your latest code yet (it's a bit hard without a 
> checked out working copy of the CVS image, btw, could you post diffs so 
> others can more readily review them?), but from what you are describing

I attatch diffs for SegmentMerger, SegmentReader, and IndexWriter to this

> here's what I think. It sounds like it would work, but it also sounds a 
> bit cludgy. The main thing that I don't like is that we are now 
> inventing another way of doing what Lucene already does - maintaining 
> index integrity across filesystem changes and safely deleting unneeded 
> files. I'm thinking that Lucene already has a way of switching to the 
> new segments file, but we are proposing something similar with renaming 
> of the cfs file.

But I am using the deletion mechanism of IndexWriter.

> A note on the norms with .f and .s files - this is getting complicated...

The problem here is that if I use .f for changed norms and .f is still in the
deletable list that was generated by an IndexWriter, the norm changes could
get lost!

> One note on SegmentReader.files() - we should probably have the "tmp" 
> extension listed here so we can cleanup segments that failed to create a 
> cfs file.

Yes, good idea.

> Here's an alternative idea that leverages existing Lucene segments file:
> Could we simply create compound file in a new segment? This way we don't 
> have to invent the "tmp" file or change anything else about the files 
> (like the norms stuff).
> All in all, I haven't really been involved in Lucene codebase closely 
> enough lately, and this is starting to impact things like norms, locks, 
> and merging, so that I don't feel qualified to make the final call on 
> this. I'd like to hear what Doug and others think. From my point of 
> view, I don't really see anything *wrong* with the latest set of changes 
> (just need to add "tmp" file to SegmentReader.files()), but it doesn't 
> strike me as an obviously *right* way to do this either yet. So I'll 
> change my vote to a 0 and see what others think. :)

Yes, lets wait for Doug and others for a final decision.


View raw message