lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Busch <>
Subject Re: Questions about doc store files (.cfx)
Date Mon, 09 Nov 2009 15:10:34 GMT
On 11/9/09 2:56 AM, Michael McCandless wrote:
> I think you're asking about the benefit of using "shared doc stores" at
> all?
> CFX is just the compound format of these shared files; if compound
> file is off, then they are still shared, just as separate (.fdx/t,
> .tvx/d/f) files.
Oh yeah, that's true. I do mean the shared doc stores in general.

> For building up a single large index, I suspect the win is
> sizable, if you store fields and compute term vectors.  You save alot
> of IO not merging these files, within that one IndexWriter session.
> That said, the win is probably less than it used to be, now that we
> bulk-copy when merging these files.  Previously, without bulk copy, it
> also consumed alot of CPU to merge the files.
> And it's true that the gains only apply within one IW session, so I'd
> expect this means in practice when building a huge index from scratch
> you see sizable gains, but then when rolling smallish updates into the
> index over time, there's no real gain. Though that's something we could
> [alternatively] pursue improving (eg if we allowed a single segment to
> reference multiple doc stores).

Ok, thanks for clarifying.

> I do think keeping the IO cost down during merging is important;
> removing shared doc stores would be at step backwards (though,
> I agree, would simplify things).

Well, I was just wondering if you or anyone else had any numbers that 
quantify the benefits of the shared stores. If it really helps a lot I 
agree it's a good thing to have them. But they do add a layer of 
complexity to the code (and to the way one has to think about segments), 
so if the win is smallish this might not be desirable. Btw: I'm not 
trying to say it's required to remove them for parallel indexing. It'd 
be just be simpler without them. You can think about a segmented 
parallel index as a matrix of segments. And about the shared doc stores 
as merging multiple cells in a single row or column of a spreadsheet. 
It'd be a bit easier if that wasn't possible and it always was a true 


> Mike
> On Mon, Nov 9, 2009 at 3:17 AM, Michael Busch<>  wrote:
>> Hi,
>> I'm wondering about the benefits of having the .cfx files. The main
>> advantage is that you avoid merging (copying) stored fields and TermVectors
>> during segment merge, right? And I think .cfx files are only shared across
>> segments if the same IndexWriter is used to flush multiple segments and then
>> to commit all those segments in a single transaction. Then those segments
>> share the same .cfx file, correct? And in such a case .cfx files are also
>> not merged into .cfs files?
>> How big is usually the win of using .cfx files? I'm wondering, because the
>> .cfx file is the only one that spans over multiple segments and therefore
>> adds more complexity to the code. For parallel indexing it'd be nice to not
>> have those kind of files that belong to multiple segments, especially when
>> we want to update certain fields.
>>   Michael
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail:
>> For additional commands, e-mail:
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message