lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael Busch (JIRA)" <>
Subject [jira] Commented: (LUCENE-2324) Per thread DocumentsWriters that write their own private segments
Date Tue, 20 Apr 2010 15:52:52 GMT


Michael Busch commented on LUCENE-2324:

I still think this "zero sync'd code" at the cost of perf loss /
exposing per-DWPT details is taking things too far. You're cutting
into the bone... 

No worries - I haven't started implementing the RAM management part yet. :)

I don't think we should apps to be setting per-DWPT RAM limits, or,
even expose to apps how IW manages threads (this is an impl. detail).

I think we should keep the approach we have today - you set the
overall RAM limit and IW internally manages flushing when that
allotted RAM is full.

I think the reason why we have two different APIs in mind (you: sourceID, I:
expert thread binder API) is that we're having different goals with them? You
want to make the out-of-the-box indexing performance as good as possible, and
users should have to set a minimum amount of easy-to-understand parameters
(such as buffer size in MB). I think that's the right thing to do of course.
(though that doesn't prevent us from adding an expert API in addition, as we
always have)

I'm thinking a lot about real-time indexing and the searchable RAM buffer
these days, so the thread-binder API could help you to have more control over
where your docs will actually end up and which reader will see them. But I
think too that this API would be very "expert" and not many people would use

bq. We can do this as a separate issue... it's fairly orthogonal.

Yeah I was just thinking the same - I agree.

Is this sync really so bad? First, we should move all
allocators/pools to per-DWPT, so they don't need to be sync'd.

OK cool that we agree on that. I was worried you wanted to have global pools
too, if it's only the single long it's not very complicated, I agree.

We're still gonna need sync'd code, anyway (global sequence ID,
grabbing a DWPT), right? We can put this "go flush a DWPT" logic in
the same block if we really have to? It feels like we're going to
great lengths (cutting into the bone) to avoid a trivial cost (the
minor complexity of managing flushing based on aggregate RAM used).

Sorry if I'm being annoying :) Yeah sure, there will be several sync'd spots.
If we don't share any data structures between threads that hold indexed (and
in the future searchable) data I'm happy.

I haven't spent as much time as you thinking about the current RAM management
yet and the current code that ensures thread safety - still learning some
parts of the code. I do appreciate all your patient feedback!

This isn't great... I mean it's weird that on adding say a 3rd
indexing thread I suddenly see a flush triggered even though I'm
nowehere near the RAM limit. Then, later, if I cut back to using only
2 threads, I still only ever use up to 2/3rd of my RAM buffer. IW's
API really shouldn't have such "surprising" behavior where how many /
which threads come through it so drastically affect it's flushing

Yeah I don't really like that either. Let's not do that. I had first not
thought about that disadvantage, added this point later to the list, and never
really liked it. (and knew you would complain about it :) )

My goal is to have a default indexing chain that isn't slower than the one we
have today, but searchable and that very fast. That's not trivial, but I think
we can do it!

I'll implement the global flush trigger and make all pools DWPT-local. The
explicit thread-binder or sourceID APIs we can worry about later, as we agreed

> Per thread DocumentsWriters that write their own private segments
> -----------------------------------------------------------------
>                 Key: LUCENE-2324
>                 URL:
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Michael Busch
>            Assignee: Michael Busch
>            Priority: Minor
>             Fix For: 3.1
>         Attachments: lucene-2324.patch, LUCENE-2324.patch
> See LUCENE-2293 for motivation and more details.
> I'm copying here Mike's summary he posted on 2293:
> Change the approach for how we buffer in RAM to a more isolated
> approach, whereby IW has N fully independent RAM segments
> in-process and when a doc needs to be indexed it's added to one of
> them. Each segment would also write its own doc stores and
> "normal" segment merging (not the inefficient merge we now do on
> flush) would merge them. This should be a good simplification in
> the chain (eg maybe we can remove the *PerThread classes). The
> segments can flush independently, letting us make much better
> concurrent use of IO & CPU.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message