lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Busch <>
Subject Re: Questions about doc store files (.cfx)
Date Tue, 10 Nov 2009 18:18:21 GMT
On 11/10/09 1:57 AM, Michael McCandless wrote:
>> I think this is exactly what happens? I wrote a small test program that
>> creates a situation like mentioned above in the "expungeDelete" scenario. It
>> ends up with a docstore containing docs from two segments, but after
>> expungeDeletes only one segment references the docstore. The non-deleted
>> docs from the other segment end up in a new segment, so they are twice on
>> disk (once orphaned in the old docstore, once in the new segment).
>> Is that the desired behavior?
> Right this is what happens -- since segment C wasn't merged, it
> remains as the only segment still referencing the shared doc stores,
> and, yes, this does result in duplicate storage for some docs (until C
> is merged away).  IFD keeps track of whether a given set of doc stores
> is still referenced.

OK, thanks for clarifying!

> I think in practice this should not result in too much duplication.
> If C is large, it's likely to have accumulated deletes as well.  If C
> is small, it's likely to get merged away in the course of normal
> merging.

I agree - it shouldn't happen very often. I was just not sure how the 
current behavior in this corner case was and wanted to understand it.

> But, if we are really concerned with it, we could modify the merge
> policy to bias its selection on this ("remove stores that are wasting
> too much space") basis.

I'm not too concerned, because I also don't think this should happen 
very often.

> I think this makes the parallel index job's simpler, right?  Ie, how
> the segments are sharing the stores within their own index does not
> restrict what merging is done.

Yes exactly. It won't prevent us from keeping the parallel indexes 
independent in this regard.

Then the compound (.cfx and .cfs) files are rather orthogonal to this. I 
talked to Marvin on ApacheCon; in Lucy he wants to have all the compound 
file support in the store package, separately from the indexer. I think 
that would make sense in Lucene too, there's not really the need to have 
it tightly integrated in the IndexWriter and SegmentMerger. We can 
generalize the compound file concept further, so that with parallel 
indexes the files can be selected in either direction for inclusion in a 
compound file.

E.g. if we separated the inverted index and store, so that they are 
logically two parallel index components, then the .cfx file as it works 
now would contain files from two parallel index components (term vectors 
from inverted index, stored fields from the store). This is fine if you 
don't want to update those components individually and can remain this 
way for the default IndexWriter implementation. But if we generalize the 
compound concept, then people can alter this behavior to better suit 
their update requirements.

I think this would actually be a very clean design (even though it might 
sound complicated here).

> Mike
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message