lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shai Erera <>
Subject Re: Per-Thread DW and IW
Date Wed, 21 Apr 2010 17:12:35 GMT
I don't advocate to develop PI as an external entity to Lucene, you've
already done that ! :)

We should open up IW enough to develop PI efficiently, but I think we should
always allow some freedom and flexibility to using applications. If IW
simply created a Parallel DW, handle the merges on its own as if those are
just one big happy bunch of Directories, then apps won't be able to plug in
their own custom IWs, such as a FacetedIW maybe (one which handles the
facets in the application).

If that 'openness' of IW is the SegmentsWriter API, then that might be
enough. I imagine apps will want to control things like add/update/delete of
documents, but it should be IW which controls the MP and MS for all slices
(you should give your own, but it will be one MP and MS for all slices, and
not one per slice). Also, methods like addIndexes* probably cannot be
supported by PI, unless we add a special method signature which accept
ParallelWriter[] or some such.

Currently, I view SegmentWriter as DocumentWriter, and so I think I'm
operating under such low-level assumptions. But since I work over IW, some
things are buried too low. Maybe we should refactor IW first, before PI is
developed ... any estimates on when PerThread DW is going to be ready? :)


On Wed, Apr 21, 2010 at 6:48 PM, Michael Busch <> wrote:

> Yeah, sounds like we have the same things in mind here.  In fact, this is
> pretty similar to what we discussed a while ago on LUCENE-2026 I think.
> SegmentWriter could be a higher level interface with more than one
> implementation.  E.g. there could be one SegmentWriter that supports
> appending documents (i.e. the DocumentsWriter today) and also one that
> allows adding terms at-a-time, e.g. similar to what IW.addIndexes*() does
> today.  Often when you rewrite entire parallel slices you don't want to use
> addDocument().  E.g. when you read from a source slice, modify some data and
> write a new version of that slice it can be dramatically faster to write
> postinglist after postinglist,  because you avoid parallel I/O and a lot of
> seeks. (with dramatically faster I mean e.g. 24 hrs vs. 8 mins, actual
> numbers from an implementation I had at IBM...)
> Further, I imagine to utilize the slice concept within Lucene.  The store
> could be a separate slice, and so could be the norms and the new flexible
> scoring data structures.  It's then super easy to turn those off or rewrite
> them individually (see LUCENE-2025).  Often parallel indexes don't need a
> store or norms, so this slice concept makes total sense in my opinion.
>  Norms actually works like this already, you can rewrite them which bumps up
> their generation number.  We just have to make this concept more abstract,
> so that it can be used for any kind of slice.
> Many people have also asked about allowing Lucene to manage external data
> structures.  I think these changes would allow exactly that:  just implement
> your external data structure as a slice, and Lucene will call your code when
> merging, deletions, adds happen. Cool! :)
> @Shai: If we implement Parallel indexing outside of Lucene's core then we
> have some of the same drawbacks as with the current master-slave approach.
>  I'm especially worried about how that would work then with realtime
> indexing (both searchable RAM buffer and also NRT).  I think PI must be
> completely segment-aware.  Then it should fit very nicely into realtime
> indexing, which is also very cool!
>  Michael
> On 4/21/10 8:06 AM, Michael McCandless wrote:
>> I do think the idea of an abstract class (or interface) SegmentWriter
>> is compelling.
>> Each DWPT would be a [single-threaded] SegmentWriter.
>> And then we'd make a MultiThreadedSegmentWriterWrapper (manages a
>> collection of SegmentWriters, deleting to them, aggregating RAM used
>> across all, manages picking which ones to flush, etc.).
>> Then, a SlicedSegmentWriter (say) would write to separate slices,
>> single threaded, and then you could make it multi-threaded by wrapping
>> w/ the above class.
>> Though SegmentWriter isn't a great name since it would in general
>> write to multiple segments.  Indexer is a little too broad though :)
>> Something like that maybe?
>> Also, allowing an app to directly control the underlying
>> SegmentWriters inside IndexWriter (instead of letting the
>> multi-threaded wrapper decide for you) is compelling for way advanced
>> apps, I think.  EG your app may know it's done indexing from source A
>> for a while, so, you should right now go and flush it (whereas the
>> default "flush the one using the most RAM" could leave that source
>> unflushed for a quite a while, tying up RAM, unless we do some kind of
>> LRU flushing policy or something).
>> Mike
>> On Wed, Apr 21, 2010 at 2:27 AM, Shai Erera<>  wrote:
>>> I'm not sure that a Parallel DW would work for PI because DW is too
>>> internal
>>> to IW. Currently, the approach I've been thinking about for PI is to
>>> tackle
>>> it from a high level, e.g. allow the application to pass a Directory, or
>>> even an IW instance, and PI will play the coordinator role, ensuring that
>>> merge of segments happens across all the slices in accordance,
>>> implementing
>>> two-phase operations etc. A Parallel DW then does not fit nicely w/ that
>>> approach (unless we want to refactor how IW works completely) because DW
>>> is
>>> not aware of the Directory, and if PI indeed works over IW instances,
>>> then
>>> each will have its own DW.
>>> So there are two basic approaches we can take for PI (following current
>>> architecture) - either let PI manage IW, or have PI a sort of IW itself,
>>> which handles events at a much lower level. While the latter is more
>>> robust
>>> (and based on current limitations I'm running into, might be even easier
>>> to
>>> do), it lacks the flexibility of allowing the app to plug any IW it
>>> wants.
>>> That requirement is also important, if the application wants to use PI in
>>> scenarios where it keeps some slices in RAM and some on disk, or it wants
>>> to
>>> control more closely which fields go to which slice, so that it can at
>>> some
>>> point in time "rebuild" a certain slice outside PI and replace the
>>> existing
>>> slice in PI w/ the new one ...
>>> We should probably continue the discussion on PI, so I suggest we either
>>> move it to another thread or on the issue directly.
>>> Mike - I agree w/ you that we should keep the life of the application
>>> developers easy and that having IW itself support concurrency is
>>> beneficial.
>>> Like I said ... it was just a thought which was aimed at keeping our life
>>> (Lucene developers) easier, but that probably comes second compared to
>>> app-devs life :). I'm not at all sure also that that would have make our
>>> life easier ...
>>> So I'm good if you want to drop the discussion.
>>> Shai
>>> On Tue, Apr 20, 2010 at 8:16 PM, Michael Busch<>
>>>  wrote:
>>>> On 4/19/10 10:25 PM, Shai Erera wrote:
>>>>> It will definitely simplify multi-threaded handling for IW extensions
>>>>> like Parallel Index …
>>>> I'm keeping Parallel indexing in mind.  After we have separate DWPT I'd
>>>> like to introduce parallel DWPTs, that write different slices.
>>>>  Synchronization should not be a big worry then, because writing is
>>>> single-threaded.
>>>> We could introduce a new abstract class SegmentWriter, which DWPT would
>>>> implement.  An extension would be ParallelSegmentWriter, which would
>>>> manage
>>>> multiple SegmentWriters.   Or maybe SegmentSliceWriter would be a better
>>>> name.
>>>>  Michael
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail:
>>>> For additional commands, e-mail:
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail:
>> For additional commands, e-mail:
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

View raw message