lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael McCandless <luc...@mikemccandless.com>
Subject Re: Questions about doc store files (.cfx)
Date Tue, 10 Nov 2009 09:57:16 GMT
On Tue, Nov 10, 2009 at 12:06 AM, Michael Busch <buschmic@gmail.com> wrote:
> On 11/9/09 5:40 PM, Michael Busch wrote:
>>
>> I think that should be ok with parallel indexing, as long as we can always
>> select all corresponding segments from *all* parallel indexes for a merge to
>> keep the docIds in sync.
>>
>> That actually leads me to another question: Let's say you have three
>> segments a, b, c.  b and c share the same doc store. You perform deletes on
>> a and b. Then you call expungeDeletes(). Normally that call should only
>> merge a and b, because c doesn't have any deletes. But b and c have to
>> participate in the same merge, because they share the same doc store, right?
>> So would it merge all three segments?
>>
>> If that's the case (that b and c must be part of the same merge) then it
>> would make the parallel indexing more difficult. The reason is that if two
>> parallel indexes 1 and 2 can decide on their own how to share e.g. doc
>> stores across segments, then we might come into a situation where 1a and 1b
>> share the same doc store, and 2b and 2c share the same doc store. Then if
>> index 1 needs to merge 1a and 1b, it can't assume that this merge is
>> allowed. There would have to be someone on top of the whole thing who
>> decides that all three segments need to be merged at the same time, because
>> b is connected to a and c in the two parallel indexes. I wouldn't like such
>> a restriction very much.
>>
>> We could think about allowing merges like ab->d, even if b,c share the
>> same doc store. That would mean to copy the b part of the shared bc doc
>> store into the new segment d. Then until c gets deleted the stored docs of b
>> would be on disk twice and require more disk space temporarily.
>>
>
> I think this is exactly what happens? I wrote a small test program that
> creates a situation like mentioned above in the "expungeDelete" scenario. It
> ends up with a docstore containing docs from two segments, but after
> expungeDeletes only one segment references the docstore. The non-deleted
> docs from the other segment end up in a new segment, so they are twice on
> disk (once orphaned in the old docstore, once in the new segment).
> Is that the desired behavior?

Right this is what happens -- since segment C wasn't merged, it
remains as the only segment still referencing the shared doc stores,
and, yes, this does result in duplicate storage for some docs (until C
is merged away).  IFD keeps track of whether a given set of doc stores
is still referenced.

I think in practice this should not result in too much duplication.
If C is large, it's likely to have accumulated deletes as well.  If C
is small, it's likely to get merged away in the course of normal
merging.

But, if we are really concerned with it, we could modify the merge
policy to bias its selection on this ("remove stores that are wasting
too much space") basis.

I think this makes the parallel index job's simpler, right?  Ie, how
the segments are sharing the stores within their own index does not
restrict what merging is done.

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message