lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael Busch (JIRA)" <>
Subject [jira] Commented: (LUCENE-2293) IndexWriter has hard limit on max concurrency
Date Thu, 04 Mar 2010 17:49:27 GMT


Michael Busch commented on LUCENE-2293:

bq. Yes, I think each DW will have to record its own buffered delete Term/Query, mapping to
its docID at the time the delete arrived. 

I think in the future deletes in DW could work like this:
- DW keeps of course track of a private sequence id, which gets incremented in the add, delete,
update calls
- a DW has a getReader() call, the reader can search the ram buffer
- when DW.gerReader() gets called, then the new reader remembers the current seqID at the
time it was opened - let's call it RAMReader.seqID; if such a reader gets reopened, simply
its seqID gets updated.
- we keep an growing int array with the size of DW's maxDoc, which replaces the usual deletes
- when DW.updateDocument() or .deleteDocument() needs to delete a doc we do that right away,
before inverting the new doc. We can do that by running a query using a RAMReader to find
all docs that must be deleted. Instead of flipping a bit in a bitset, for each hit we now
keep track of when it was deleted:

// init each slot in deletes array with -1
static final int NOT_DELETED = Integer.MAX_INT;
Arrays.fill(deletes, NOT_DELETED);


public void deleteDocument(Query q) {
  reopen RAMReader
  run query q using RAMReader
  for each hit {
    int hitDocId = ...
    if (deletes[hitDocId] == NOT_DELETED) {
      deletes[hitDocId] = DW.seqID;

Now no matter of how often you (re)open RAMReaders, they can share the deletes array. No cloning
like with the BitSet approach would be necessary:

When the RAMReader iterates posting lists it's as simple as this to treat deletes docs correctly.
Instead of doing this in
  if (deletedDocsBitSet.get(doc)) {
    skip this doc

we can now do:

  if (deletes[doc] < ramReader.seqID) {
    skip this doc

Here is an example:
1. Add 3 docs with DW.addDocument() 
2. User opens ramReader_a
3. Delete doc 1
4. User opens ramReader_b

After 1: DW.seqID = 2; deletes[]={MAX_INT, MAX_INT, MAX_INT}
After 2: ramReader_a.seqID = 2
After 3: DW.seqID = 3; deletes[]={MAX_INT, 2, MAX_INT}
After 3: ramReader_b.seqID = 3

Note that both ramReader_a and ramReader_b share the same deletes[] array. Now when ramReader_a
is used to read posting lists, it will not treat doc 1 as deleted, because (deletes[1] <
ramReader_a.seqID) = (2 < 2) = false; But ramReader_b will see it as deleted, because (deletes[1]
< ramReader_b.seqID) = (2 < 3) = true.

What do you think about this approach for the future when we have a searchable DW buffer?

> IndexWriter has hard limit on max concurrency
> ---------------------------------------------
>                 Key: LUCENE-2293
>                 URL:
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Index
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>             Fix For: 3.1
> DocumentsWriter has this nasty hardwired constant:
> {code}
> private final static int MAX_THREAD_STATE = 5;
> {code}
> which probably I should have attached a //nocommit to the moment I
> wrote it ;)
> That constant sets the max number of thread states to 5.  This means,
> if more than 5 threads enter IndexWriter at once, they will "share"
> only 5 thread states, meaning we gate CPU concurrency to 5 running
> threads inside IW (each thread must first wait for the last thread to
> finish using the thread state before grabbing it).
> This is bad because modern hardware can make use of more than 5
> threads.  So I think an immediate fix is to make this settable
> (expert), and increase the default (8?).
> It's tricky, though, because the more thread states, the less RAM
> efficiency you have, meaning the worse indexing throughput.  So you
> shouldn't up and set this to 50: you'll be flushing too often.
> But... I think a better fix is to re-think how threads write state
> into DocumentsWriter.  Today, a single docID stream is assigned across
> threads (eg one thread gets docID=0, next one docID=1, etc.), and each
> thread writes to a private RAM buffer (living in the thread state),
> and then on flush we do a merge sort.  The merge sort is inefficient
> (does not currently use a PQ)... and, wasteful because we must
> re-decode every posting byte.
> I think we could change this, so that threads write to private RAM
> buffers, with a private docID stream, but then instead of merging on
> flush, we directly flush each thread as its own segment (and, allocate
> private docIDs to each thread).  We can then leave merging to CMS
> which can already run merges in the BG without blocking ongoing
> indexing (unlike the merge we do in flush, today).
> This would also allow us to separately flush thread states.  Ie, we
> need not flush all thread states at once -- we can flush one when it
> gets too big, and then let the others keep running.  This should be a
> good concurrency gain since is uses IO & CPU resources "throughout"
> indexing instead of "big burst of CPU only" then "big burst of IO
> only" that we have today (flush today "stops the world").
> One downside I can think of is... docIDs would now be "less
> monotonic", meaning if N threads are indexing, you'll roughly get
> in-time-order assignment of docIDs.  But with this change, all of one
> thread state would get 0..N docIDs, the next thread state'd get
> N+1...M docIDs, etc.  However, a single thread would still get
> monotonic assignment of docIDs.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message