lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Babak Farhang <farh...@gmail.com>
Subject Re: incremental document field update
Date Wed, 20 Jan 2010 03:45:18 GMT
> I see -- so your file format allows you to append to the same file
> without affecting prior readers?  We never do that in Lucene today
> (all files are "write once").

Yes. For the most part it only appends. The exception is when the
log's entry count is updated (when the appends actually "commit").
That count is written into one of many (>= 2 and <= 257) slots in round
robin fashion and finally a byte-size "keystone" is updated to determine
the active slot. The idea is that by flushing the file just before and just
after the keystone byte update we ensure persistence (assumes writing
one byte is always all-or-nothing). Increasing the number of slots
the count is written to narrows the chance of a bad read..

> Good questions (terms dict & stored fields) -- I think what we'd do is
> simply write a new segment, so stored fields, term vectors, postings,
> terms dict, etc., are all private to that new segment.  But, we'd mark
> the segment as being a delta segment, referencing the original
> segment, and we'd remap the docIDs when flushing that segment.

The .fdx and .tvx files use fixed-width tables, so it seems if there are
large gaps in the updated doc-ids, we'd have to fill in "null" rows for
those gaps.  Or do you have another data-structure in mind for these
.*x files?

One solution may be to combine the approaches we described: maintain
doc-id mappings for the .fdx and .tvx files for per-document
data at search time, and index-time mapped doc-ids for the posting
lists.

-Babak

On Tue, Jan 19, 2010 at 3:48 AM, Michael McCandless
<lucene@mikemccandless.com> wrote:
> On Tue, Jan 19, 2010 at 1:32 AM, Babak Farhang <farhang@gmail.com> wrote:
>>> This is about multiple sessions with the writer.  Ie, open writer,
>>> update a few docs, close.  Do the same again, but, that 2nd session
>>> cannot overwrite the same files from the first one, since readers may
>>> have those files open.  The "write once" model gives Lucene its
>>> transactional semantics, too -- we can keep multiple commits in the
>>> index, open an arbitrary commit, rollback, etc.
>>
>> Ahh, I see. We're worried about the read-consistency of the open readers
>> while there is a concurrent writer. I think that's already taken care of with
>> the fact that under the hood we are only appending data into the logical
>> segment.  When a reader first loads one of our slave segments, it first
>> loads into memory the number of entries in the right-ahead log of mapped
>> doc-ids before reading those number of entries from the log. It could note
>> the max unmapped document number represented in the last entry and
>> discard any document numbers above that max in the postings.
>
> I see -- so your file format allows you to append to the same file
> without affecting prior readers?  We never do that in Lucene today
> (all files are "write once").
>
>>> I think both approaches are O(total size of updates) in indexing cost,
>>> assuming each writes only the new changes, to a new generation
>>> filename.
>>>
>>> But I think at search time the "remap during indexing" approach should
>>> simpler, since you "just" have to OR together N posting lists... and
>>> performance should be better (by a constant factor) since there's no
>>> remapping.
>>
>> I think I see how this approach can work for indexed-only fields. I imagine
>> we also need a parallel dictionary for these mapped postings lists in order
>> to deal with new terms encountered during the update. Not sure how this
>> would work. Can you elaborate?
>>
>> And how would we deal with updated stored fields?
>
> Good questions (terms dict & stored fields) -- I think what we'd do is
> simply write a new segment, so stored fields, term vectors, postings,
> terms dict, etc., are all private to that new segment.  But, we'd mark
> the segment as being a delta segment, referencing the original
> segment, and we'd remap the docIDs when flushing that segment.  Then
> at search time, a single SegmentReader would open its primary segment
> plus any series of delta segments, merging them.
>
> So eg StoredFieldsReader would first load the doc from the primary
> segment, then go and load any deltas, overwriting the field values.
>
> Mike
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message