lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Busch <busch...@gmail.com>
Subject Re: Questions about doc store files (.cfx)
Date Mon, 09 Nov 2009 15:10:34 GMT
On 11/9/09 2:56 AM, Michael McCandless wrote:
> I think you're asking about the benefit of using "shared doc stores" at
> all?
>
> CFX is just the compound format of these shared files; if compound
> file is off, then they are still shared, just as separate (.fdx/t,
> .tvx/d/f) files.
>
>    
Oh yeah, that's true. I do mean the shared doc stores in general.

> For building up a single large index, I suspect the win is
> sizable, if you store fields and compute term vectors.  You save alot
> of IO not merging these files, within that one IndexWriter session.
>
> That said, the win is probably less than it used to be, now that we
> bulk-copy when merging these files.  Previously, without bulk copy, it
> also consumed alot of CPU to merge the files.
>
> And it's true that the gains only apply within one IW session, so I'd
> expect this means in practice when building a huge index from scratch
> you see sizable gains, but then when rolling smallish updates into the
> index over time, there's no real gain. Though that's something we could
> [alternatively] pursue improving (eg if we allowed a single segment to
> reference multiple doc stores).
>
>    

Ok, thanks for clarifying.

> I do think keeping the IO cost down during merging is important;
> removing shared doc stores would be at step backwards (though,
> I agree, would simplify things).
>
>    

Well, I was just wondering if you or anyone else had any numbers that 
quantify the benefits of the shared stores. If it really helps a lot I 
agree it's a good thing to have them. But they do add a layer of 
complexity to the code (and to the way one has to think about segments), 
so if the win is smallish this might not be desirable. Btw: I'm not 
trying to say it's required to remove them for parallel indexing. It'd 
be just be simpler without them. You can think about a segmented 
parallel index as a matrix of segments. And about the shared doc stores 
as merging multiple cells in a single row or column of a spreadsheet. 
It'd be a bit easier if that wasn't possible and it always was a true 
matrix.

  Michael


> Mike
>
> On Mon, Nov 9, 2009 at 3:17 AM, Michael Busch<buschmic@gmail.com>  wrote:
>    
>> Hi,
>>
>> I'm wondering about the benefits of having the .cfx files. The main
>> advantage is that you avoid merging (copying) stored fields and TermVectors
>> during segment merge, right? And I think .cfx files are only shared across
>> segments if the same IndexWriter is used to flush multiple segments and then
>> to commit all those segments in a single transaction. Then those segments
>> share the same .cfx file, correct? And in such a case .cfx files are also
>> not merged into .cfs files?
>>
>> How big is usually the win of using .cfx files? I'm wondering, because the
>> .cfx file is the only one that spans over multiple segments and therefore
>> adds more complexity to the code. For parallel indexing it'd be nice to not
>> have those kind of files that belong to multiple segments, especially when
>> we want to update certain fields.
>>
>>   Michael
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>
>>
>>      
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>
>    


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message