lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael Busch (JIRA)" <>
Subject [jira] Commented: (LUCENE-2324) Per thread DocumentsWriters that write their own private segments
Date Thu, 22 Apr 2010 00:01:54 GMT


Michael Busch commented on LUCENE-2324:

The deletes aren't entirely in the foreground, only the RAM
buffer deletes. Deletes to existing segments would use the
existing clone and delete mechanism.

I think deletes should all be based on the new sequenceIDs. Today the buffered
deletes depend on a value that can change (numDocs changes when segments are
merged). But with sequenceIDs they're absolute: each delete operation will
have a sequenceID assigned, and the ordering of all write operations (add,
update, delete) is unambiguously defined by the sequenceIDs. Remember that
addDocument/updateDocument/deleteDocuments will all return the sequenceID.

This means that we don't have to remap deletes anymore. Also the pooled
SegmentReaders that read the flushed segments can use arrays of sequenceIDs
instead of BitSets? Of course that needs more memory, but even if you add 1M
docs without ever calling IW.commit/close you only need 8MB - I think that's
acceptable. And this size is independent on how many times you call
reopen/clone on the realtime readers, because they can all share the same
deletes array.

We can also modify IW.commit/close to return the latest sequenceID. This would
be very nice for document tracking. E.g. when you hit an aborting exception
after you flushed already some segments, then even though you must discard
everything in the DW's buffer you can still call commit/close to commit
everything that doesn't have to be discarded. IW.commit/close would in that
case return the sequenceID of the latest write operation that was successfully
committed, i.e. that would be visible to an IndexReader. Though we have to be
careful here: multiple segments can have interleaving segmentIDs, so we must
discard every segment that has one or more sequenceIDs greater than the lowest
one in the DW. So we still need the push-deletes logic, that keeps RAM deletes
separate from the flushed ones until flushing was successful.

DW/IW need to keep track of the largest sequenceID that is *safe*, i.e. that
could be committed even if DW hits an aborting exception. Some invariants: 
 * All deletes with sequenceID smaller or equal to safeSeqID have already been
applied to the deletes arrays of the flushed segments. 
 * All deletes with sequenceID greater than safeSeqID are in a deletesInRAM 
 * safeSeqID is always smaller than any buffered doc or buffered delete in DW 
or DWPT.
 * safeSeqID is always equal to the maximum sequenceID of one, and only one,
flushed segment.

When IW.close/commit is called then safeSeqID is returned. If no aborting
exception occurred it equals the highest sequenceID ever assigned (during that
IW "session"). In any case it's always the sequenceID of the latest write
operation an IndexReader will "see" that you open after IW.close/commit

But this would be nice for apps to track which docs made it successfully into
the index. Apps can then externally keep a log to figure out what they have to
reply in case of exceptions.

Does all this make sense? :) This is very complex stuff, I wouldn't be
surprised if there's something I didn't think about.

> Per thread DocumentsWriters that write their own private segments
> -----------------------------------------------------------------
>                 Key: LUCENE-2324
>                 URL:
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Michael Busch
>            Assignee: Michael Busch
>            Priority: Minor
>             Fix For: 3.1
>         Attachments: lucene-2324.patch, LUCENE-2324.patch
> See LUCENE-2293 for motivation and more details.
> I'm copying here Mike's summary he posted on 2293:
> Change the approach for how we buffer in RAM to a more isolated
> approach, whereby IW has N fully independent RAM segments
> in-process and when a doc needs to be indexed it's added to one of
> them. Each segment would also write its own doc stores and
> "normal" segment merging (not the inefficient merge we now do on
> flush) would merge them. This should be a good simplification in
> the chain (eg maybe we can remove the *PerThread classes). The
> segments can flush independently, letting us make much better
> concurrent use of IO & CPU.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message