lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Busch <busch...@gmail.com>
Subject Re: Questions about doc store files (.cfx)
Date Tue, 10 Nov 2009 01:40:11 GMT
On 11/9/09 9:00 AM, Michael McCandless wrote:
> Alas, I don't have any benchmarks offhand... if you want to run one,
> you should be able to hardwire flushDocStores=true in
> IndexWriter.doFlushInternal?  I think that'd turn off the sharing
> without breaking things (run the tests to be sure ;) ).
>
>    

Yes, I'm pretty sure that works. I think I've even done that in the 
LUCENE-1879 patch (which works with Lucene 2.4).
>> Btw: I'm not trying to say it's
>> required to remove them for parallel indexing. It'd be just be simpler
>> without them. You can think about a segmented parallel index as a matrix of
>> segments. And about the shared doc stores as merging multiple cells in a
>> single row or column of a spreadsheet. It'd be a bit easier if that wasn't
>> possible and it always was a true matrix.
>>      
> I agree, not sharing the stores would make things simpler.  Wouldn't
> the parallel indexes be able to "privately" share their own stores?
> Ie, how the sharing happens need not be in sync across the main&
> parallel indexes?
>
>    

I think that should be ok with parallel indexing, as long as we can 
always select all corresponding segments from *all* parallel indexes for 
a merge to keep the docIds in sync.

That actually leads me to another question: Let's say you have three 
segments a, b, c.  b and c share the same doc store. You perform deletes 
on a and b. Then you call expungeDeletes(). Normally that call should 
only merge a and b, because c doesn't have any deletes. But b and c have 
to participate in the same merge, because they share the same doc store, 
right? So would it merge all three segments?

If that's the case (that b and c must be part of the same merge) then it 
would make the parallel indexing more difficult. The reason is that if 
two parallel indexes 1 and 2 can decide on their own how to share e.g. 
doc stores across segments, then we might come into a situation where 1a 
and 1b share the same doc store, and 2b and 2c share the same doc store. 
Then if index 1 needs to merge 1a and 1b, it can't assume that this 
merge is allowed. There would have to be someone on top of the whole 
thing who decides that all three segments need to be merged at the same 
time, because b is connected to a and c in the two parallel indexes. I 
wouldn't like such a restriction very much.

We could think about allowing merges like ab->d, even if b,c share the 
same doc store. That would mean to copy the b part of the shared bc doc 
store into the new segment d. Then until c gets deleted the stored docs 
of b would be on disk twice and require more disk space temporarily.

Well maybe there is already a solution for all this in the code and I'm 
just not aware of it?

  Michael


> Mike
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>
>    


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message