lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Babak Farhang <farh...@gmail.com>
Subject Re: incremental document field update
Date Fri, 22 Jan 2010 06:39:37 GMT
> OK; this approach (modifying an already written & possible in-use (by
> an IndexReader) file) would be problematic for Lucene...

If you have N slots, there would have to be N-1 commits + an Nth commit
in progress while reading the "entry-count block"  for there to be the
possibility of
a bad read. Make N large enough (max 256), and that should close the
window, I think.

Any way, just want to thank you Mike for sharing your thoughts and
ideas. Time to
try some of them..

Cheers,
-Babak

On Wed, Jan 20, 2010 at 3:41 AM, Michael McCandless
<lucene@mikemccandless.com> wrote:
> On Tue, Jan 19, 2010 at 10:45 PM, Babak Farhang <farhang@gmail.com> wrote:
>>> I see -- so your file format allows you to append to the same file
>>> without affecting prior readers?  We never do that in Lucene today
>>> (all files are "write once").
>>
>> Yes. For the most part it only appends. The exception is when the
>> log's entry count is updated (when the appends actually "commit").
>> That count is written into one of many (>= 2 and <= 257) slots in round
>> robin fashion and finally a byte-size "keystone" is updated to determine
>> the active slot. The idea is that by flushing the file just before and just
>> after the keystone byte update we ensure persistence (assumes writing
>> one byte is always all-or-nothing). Increasing the number of slots
>> the count is written to narrows the chance of a bad read..
>
> OK; this approach (modifying an already written & possible in-use (by
> an IndexReader) file) would be problematic for Lucene...
>
>>> Good questions (terms dict & stored fields) -- I think what we'd do is
>>> simply write a new segment, so stored fields, term vectors, postings,
>>> terms dict, etc., are all private to that new segment.  But, we'd mark
>>> the segment as being a delta segment, referencing the original
>>> segment, and we'd remap the docIDs when flushing that segment.
>>
>> The .fdx and .tvx files use fixed-width tables, so it seems if there are
>> large gaps in the updated doc-ids, we'd have to fill in "null" rows for
>> those gaps.  Or do you have another data-structure in mind for these
>> .*x files?
>
> Yeah, good point; we'd have to allow for a sparse index storage for fdx/tvx.
>
>> One solution may be to combine the approaches we described: maintain
>> doc-id mappings for the .fdx and .tvx files for per-document
>> data at search time, and index-time mapped doc-ids for the posting
>> lists.
>
> True, we could do both...
>
> Mike
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message