lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jason Rutherglen (JIRA)" <>
Subject [jira] Commented: (LUCENE-2047) IndexWriter should immediately resolve deleted docs to docID in near-real-time mode
Date Mon, 16 Nov 2009 18:57:39 GMT


Jason Rutherglen commented on LUCENE-2047:

I want to replay how DW handle the updateDoc call to see if my
understanding is correct. 

1: Analyzing hits an exception for a doc, it's doc id has
already been allocated so we mark it for deletion later (on
flush?) in BufferedDeletes.

2: RAM Buffer writing hits an exception, we've had updates which
marked deletes in current segments, however they haven't been
applied yet because they're stored in BufferedDeletes docids.
They're applied on successful flush. 

Are these the two scenarios correct or am I completely off
target? If correct, isn't update doc already deleting in the

bq. prefer not to add further BG threads

Maybe we can use 1.5's ReentrantReadWriteLock to effectively
allow multiple del/update doc calls to concurrently acquire the
read lock, and perform the deletes in the foreground. The write
lock could be acquired during commitDeletes, commit(), and after
a segment is flushed? I'm not sure it would be necessary to
acquire this write lock anytime segment infos is changed?

I think it's important to remove unnecessary global locks on
unitary operations (like deletes). We've had great results
removing these locks for isDeleted, (NIO)FSDirectory where we
didn't think there'd be an improvement, and there was. I think
this patch (or a follow on one that implements the shared lock
solution) could effectively increase throughput (for deleting
and updating), measurably.

{quote}Lucene shouldn't aim to be able to reopen 100s of times
per second{quote}

Reopening after every doc could be a valid case that I suspect
will come up again in the future. I don't think it's too hard to

{quote} It's true that net latency of reopen will be reduced by
being incremental, but Lucene shouldn't aim to be able to reopen
100s of times per second: {quote}

Perhaps update/del throughput will increase because of the
shared lock which would makes the patch(s) worth implementing.

{quote} but I bet in practice that concurrency isn't necessary
(ie the performance of a single thread resolving all buffered
deletes is plenty fast). {quote}

We thought the same thing about the sync in FSDirectory, and it
turned out that in practice, NIOFSDir is an order of magnitude
faster on *nix machines. For NRT, every little bit of
concurrency will probably increase throughput. (i.e. most users
will have their indexes in IO cache and/or a ram dir, which
means we wouldn't be penalizing concurrency as we are today with
the global lock IW for del/up docs). 

I'm going to go ahead and wrap up this patch, which will shift
deletion cost to the del/up methods (still synchronously). Then
create a separate patch that implements the shared lock

Exposing SRs for updates by the user can be done today, I'll
open a patch for this.

> IndexWriter should immediately resolve deleted docs to docID in near-real-time mode
> -----------------------------------------------------------------------------------
>                 Key: LUCENE-2047
>                 URL:
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 3.1
>         Attachments: LUCENE-2047.patch, LUCENE-2047.patch
> Spinoff from LUCENE-1526.
> When deleteDocuments(Term) is called, we currently always buffer the
> Term and only later, when it's time to flush deletes, resolve to
> docIDs.  This is necessary because we don't in general hold
> SegmentReaders open.
> But, when IndexWriter is in NRT mode, we pool the readers, and so
> deleting in the foreground is possible.
> It's also beneficial, in that in can reduce the turnaround time when
> reopening a new NRT reader by taking this resolution off the reopen
> path.  And if multiple threads are used to do the deletion, then we
> gain concurrency, vs reopen which is not concurrent when flushing the
> deletes.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message