lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael McCandless <>
Subject Re: incremental document field update
Date Tue, 19 Jan 2010 10:48:29 GMT
On Tue, Jan 19, 2010 at 1:32 AM, Babak Farhang <> wrote:
>> This is about multiple sessions with the writer.  Ie, open writer,
>> update a few docs, close.  Do the same again, but, that 2nd session
>> cannot overwrite the same files from the first one, since readers may
>> have those files open.  The "write once" model gives Lucene its
>> transactional semantics, too -- we can keep multiple commits in the
>> index, open an arbitrary commit, rollback, etc.
> Ahh, I see. We're worried about the read-consistency of the open readers
> while there is a concurrent writer. I think that's already taken care of with
> the fact that under the hood we are only appending data into the logical
> segment.  When a reader first loads one of our slave segments, it first
> loads into memory the number of entries in the right-ahead log of mapped
> doc-ids before reading those number of entries from the log. It could note
> the max unmapped document number represented in the last entry and
> discard any document numbers above that max in the postings.

I see -- so your file format allows you to append to the same file
without affecting prior readers?  We never do that in Lucene today
(all files are "write once").

>> I think both approaches are O(total size of updates) in indexing cost,
>> assuming each writes only the new changes, to a new generation
>> filename.
>> But I think at search time the "remap during indexing" approach should
>> simpler, since you "just" have to OR together N posting lists... and
>> performance should be better (by a constant factor) since there's no
>> remapping.
> I think I see how this approach can work for indexed-only fields. I imagine
> we also need a parallel dictionary for these mapped postings lists in order
> to deal with new terms encountered during the update. Not sure how this
> would work. Can you elaborate?
> And how would we deal with updated stored fields?

Good questions (terms dict & stored fields) -- I think what we'd do is
simply write a new segment, so stored fields, term vectors, postings,
terms dict, etc., are all private to that new segment.  But, we'd mark
the segment as being a delta segment, referencing the original
segment, and we'd remap the docIDs when flushing that segment.  Then
at search time, a single SegmentReader would open its primary segment
plus any series of delta segments, merging them.

So eg StoredFieldsReader would first load the doc from the primary
segment, then go and load any deltas, overwriting the field values.


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message