lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael McCandless <luc...@mikemccandless.com>
Subject Re: Questions about doc store files (.cfx)
Date Mon, 09 Nov 2009 10:56:40 GMT
I think you're asking about the benefit of using "shared doc stores" at
all?

CFX is just the compound format of these shared files; if compound
file is off, then they are still shared, just as separate (.fdx/t,
.tvx/d/f) files.

For building up a single large index, I suspect the win is
sizable, if you store fields and compute term vectors.  You save alot
of IO not merging these files, within that one IndexWriter session.

That said, the win is probably less than it used to be, now that we
bulk-copy when merging these files.  Previously, without bulk copy, it
also consumed alot of CPU to merge the files.

And it's true that the gains only apply within one IW session, so I'd
expect this means in practice when building a huge index from scratch
you see sizable gains, but then when rolling smallish updates into the
index over time, there's no real gain. Though that's something we could
[alternatively] pursue improving (eg if we allowed a single segment to
reference multiple doc stores).

I do think keeping the IO cost down during merging is important;
removing shared doc stores would be at step backwards (though,
I agree, would simplify things).

Mike

On Mon, Nov 9, 2009 at 3:17 AM, Michael Busch <buschmic@gmail.com> wrote:
> Hi,
>
> I'm wondering about the benefits of having the .cfx files. The main
> advantage is that you avoid merging (copying) stored fields and TermVectors
> during segment merge, right? And I think .cfx files are only shared across
> segments if the same IndexWriter is used to flush multiple segments and then
> to commit all those segments in a single transaction. Then those segments
> share the same .cfx file, correct? And in such a case .cfx files are also
> not merged into .cfs files?
>
> How big is usually the win of using .cfx files? I'm wondering, because the
> .cfx file is the only one that spans over multiple segments and therefore
> adds more complexity to the code. For parallel indexing it'd be nice to not
> have those kind of files that belong to multiple segments, especially when
> we want to update certain fields.
>
>  Michael
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message