lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael McCandless <>
Subject Re: incremental document field update
Date Wed, 20 Jan 2010 10:41:04 GMT
On Tue, Jan 19, 2010 at 10:45 PM, Babak Farhang <> wrote:
>> I see -- so your file format allows you to append to the same file
>> without affecting prior readers?  We never do that in Lucene today
>> (all files are "write once").
> Yes. For the most part it only appends. The exception is when the
> log's entry count is updated (when the appends actually "commit").
> That count is written into one of many (>= 2 and <= 257) slots in round
> robin fashion and finally a byte-size "keystone" is updated to determine
> the active slot. The idea is that by flushing the file just before and just
> after the keystone byte update we ensure persistence (assumes writing
> one byte is always all-or-nothing). Increasing the number of slots
> the count is written to narrows the chance of a bad read..

OK; this approach (modifying an already written & possible in-use (by
an IndexReader) file) would be problematic for Lucene...

>> Good questions (terms dict & stored fields) -- I think what we'd do is
>> simply write a new segment, so stored fields, term vectors, postings,
>> terms dict, etc., are all private to that new segment.  But, we'd mark
>> the segment as being a delta segment, referencing the original
>> segment, and we'd remap the docIDs when flushing that segment.
> The .fdx and .tvx files use fixed-width tables, so it seems if there are
> large gaps in the updated doc-ids, we'd have to fill in "null" rows for
> those gaps.  Or do you have another data-structure in mind for these
> .*x files?

Yeah, good point; we'd have to allow for a sparse index storage for fdx/tvx.

> One solution may be to combine the approaches we described: maintain
> doc-id mappings for the .fdx and .tvx files for per-document
> data at search time, and index-time mapped doc-ids for the posting
> lists.

True, we could do both...


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message