lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael McCandless (JIRA)" <>
Subject [jira] Commented: (LUCENE-2293) IndexWriter has hard limit on max concurrency
Date Thu, 04 Mar 2010 12:59:27 GMT


Michael McCandless commented on LUCENE-2293:

I agree IW should not spawn its own threads.  It should piggy back on
incoming threads.

On whether we can remove the "perThread" layer throughout the chain --
that would be compelling.  But, we should scrutinize what that layer
does throughout the current chain to assess what we might lose.

But, I was proposing a bigger change (call it "private RAM segments"):
there would be multiple DWs, each one writing to its own private RAM
segment (each one getting private docID assignment) *and* its own doc

There would be no more WaitQueue in IW.

Each DW would flush its own segment privately.  They would not all
flush at once (merging their postings) like we must do today because
they "share" a single docID space.

As I understand it, this would be step towards how Lucy handles
concurrency during indexing.  Ie, it'd make the DWs nearly fully
independent from one another, and then IW is just there to dispatch/do
merging/etc.  (In Lucy each writer is a separate process, I think --
VERY independent).

We could do both changes, too (remove the "perThread" layer of
indexing chaing and switch to private RAM segments) -- I think they
are actually orthogonal.

bq. The other downside is that you would have to buffer deleted docs and queries separately
for each thread state, because you have to keep the private docID? So that would nee a bit
more memory.


bq. Mike, good one! Would having a doc id stream per thread make implementing a searchable
RAM buffer easier?

Yes -- they would just appear like sub segments.

bq. I hope we won't lose monotonic docIDs for a singlethreaded indexation somewhere along
that path.

We won't.

Instead, I prefer to take advantage of the application's concurrency level in the following

* Each thread will continue to write documents to a ThreadState. We'll allow changing the
MAX_LEVEL, so if an app wants to get more concurrency, it can.
  - MAX_LEVEL will set the number of ThreadState objects available.
* All threads will obtain memory buffers from a pull which will be limited by IW's RAM limit.
* When a thread finishes indexing a document and realizes the pool has been exhausted, it
flushes its ThreadState.
  - At that moment, that ThreadState is pulled out of the 'active' list and is flushed. When
it's done, it reclaims its used buffers and being put again in the active list.
  - New threads that come in will simply pick a ThreadState from the pool (but we'll bind
them to that instance until it's flushed) and add documents to them.
  - That way, we hijack an application thread to do the flushing, which is anyway what happens

+1 -- this I think matches what I was thinking.

bq. If only WaitQueue was documented

Sorry :(

But WaitQueue would go away with this change.  We would no longer have
shared doc stores!

> IndexWriter has hard limit on max concurrency
> ---------------------------------------------
>                 Key: LUCENE-2293
>                 URL:
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Index
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>             Fix For: 3.1
> DocumentsWriter has this nasty hardwired constant:
> {code}
> private final static int MAX_THREAD_STATE = 5;
> {code}
> which probably I should have attached a //nocommit to the moment I
> wrote it ;)
> That constant sets the max number of thread states to 5.  This means,
> if more than 5 threads enter IndexWriter at once, they will "share"
> only 5 thread states, meaning we gate CPU concurrency to 5 running
> threads inside IW (each thread must first wait for the last thread to
> finish using the thread state before grabbing it).
> This is bad because modern hardware can make use of more than 5
> threads.  So I think an immediate fix is to make this settable
> (expert), and increase the default (8?).
> It's tricky, though, because the more thread states, the less RAM
> efficiency you have, meaning the worse indexing throughput.  So you
> shouldn't up and set this to 50: you'll be flushing too often.
> But... I think a better fix is to re-think how threads write state
> into DocumentsWriter.  Today, a single docID stream is assigned across
> threads (eg one thread gets docID=0, next one docID=1, etc.), and each
> thread writes to a private RAM buffer (living in the thread state),
> and then on flush we do a merge sort.  The merge sort is inefficient
> (does not currently use a PQ)... and, wasteful because we must
> re-decode every posting byte.
> I think we could change this, so that threads write to private RAM
> buffers, with a private docID stream, but then instead of merging on
> flush, we directly flush each thread as its own segment (and, allocate
> private docIDs to each thread).  We can then leave merging to CMS
> which can already run merges in the BG without blocking ongoing
> indexing (unlike the merge we do in flush, today).
> This would also allow us to separately flush thread states.  Ie, we
> need not flush all thread states at once -- we can flush one when it
> gets too big, and then let the others keep running.  This should be a
> good concurrency gain since is uses IO & CPU resources "throughout"
> indexing instead of "big burst of CPU only" then "big burst of IO
> only" that we have today (flush today "stops the world").
> One downside I can think of is... docIDs would now be "less
> monotonic", meaning if N threads are indexing, you'll roughly get
> in-time-order assignment of docIDs.  But with this change, all of one
> thread state would get 0..N docIDs, the next thread state'd get
> N+1...M docIDs, etc.  However, a single thread would still get
> monotonic assignment of docIDs.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message