lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Babak Farhang <farh...@gmail.com>
Subject Re: incremental document field update
Date Mon, 18 Jan 2010 05:35:11 GMT
> Got it.  You'd presumably have to add a generation to this file, so
> that multiple sessions of updates + commit write to private files
> ("write once")?  And then the reader merges all of them.

Actually, I hadn't considered concurrent update/commit semantics;
I was thinking more along a single writer / many readers model..
Concurrent update is an altogether different beast than concurrent
add semantics, and I was hoping to avoid complexity.


> of RAM, especially the Integer overhead.  Though, these will be sparse
> right?  So it only uses RAM for docs that have updates, and when
> segment merging hasn't merged them away.

Yes.

> It sounds like this'd require the full posting list for a given term X
> to be loaded into RAM, mapped, and sorted, before the caller can
> actually pull the first docID, right?  I guess often this is likely an
> acceptable cost, since "usually" the fields you update are smallish.

I was thinking, for large posting lists, we can avoid loading the entire
list into RAM in most cases. Given that we have a sorted set of mapped
surrogate doc-ids (the key-set in the TreeMap), we should be able to
search any surrogate doc-ids that need to be mapped in the underlying
posting list using judicious calls to skipTo() on the underlying TermDocs.
(This is just like an efficient implementation of finding the intersection
of 2 sorted arrays of integers: the performance is linear in the size of
the shortest array, and often, not all the elements of the shorter array
need even be examined.)

So when the
[Filter]TermDocs instance is first positioned to a term, it first gathers
the to-be-mapped (surrogate) doc-ids in a sorted list (hopefully a small
subset of postings list), then maps these to their "view" ids in a separate
list, and finally sorts this list of mapped "view" ids. So one list contains
the (sorted) doc-ids that must be removed from the underlying posting
list, and another contains the (sorted) doc-ids that must be interleaved
into the postings.  Using these two lists, we should be able implement
a proper view of the postings.


> Thinking out loud... I wonder if, instead of deferring the mapping
> until search time (which is costly since every search must re-do it),
> we could do the mapping at indexing time?
>
> Say we add an IndexWriter.updateDocumentFields(Term, Document) method,

This method name/signature/semantics much better than
the one I proposed..

> which will update only the specified fields.  The Term must uniquely
> refer to 1 existing doc in the index.  On flush, we resolve that Term
> -> docID (we do this already, to apply deletions), and then we write
> to a new set of "delta postings" files using the correct docID.  Then
> the reader would have to merge these postings files when reading.

[Thinking out loud too] Interesting. So instead of rewriting the entire
slave segment on update we rewrite things like the "delta postings"
files on committing an update?

I'm thinking this approach would probably best work on "batch"
update/commits.

If on the other hand the use case is a sequence of N single-doc
update/commits, then the total cost of the N updates would likely
approach O(N**2) -- <N rewrites> * <avg delta postings file size = O(N)>
So as ever, there are tradeoffs.

-Babak



On Sun, Jan 17, 2010 at 6:39 AM, Michael McCandless
<lucene@mikemccandless.com> wrote:
> On Sun, Jan 17, 2010 at 7:45 AM, Babak Farhang <farhang@gmail.com> wrote:
>>> So the idea is, I can change the field for only a few docs in a
>>> massive index, and the amount of "work" done, and bytes written, is in
>>> proportion only to how many docs were changed?
>>
>> Exactly. We append auxiliary data to the parallel segment and
>> delay rewriting the segment to when it'll be merged with another segment.
>
> OK.
>
>>> What would the actual data structure look like on disk?
>>
>> Something like a surrogate docs log file
>>
>> <rowCount><<int><int>>**rowCount
>>
>> a fixed-width table, each row mapping a deleted doc-id to its "surrogate"
>> doc-id. The mapping entries are not sorted; it's a log file. If 2
>> entries reference the
>> same deleted doc-id, the last one wins. The <rowCount> header makes the
>> structure self-delimiting, and ensures clean writes (perhaps using a byte-size
>> keystone).
>
> Got it.  You'd presumably have to add a generation to this file, so
> that multiple sessions of updates + commit write to private files
> ("write once")?  And then the reader merges all of them.
>
>>> And what
>>> would be loaded into RAM such that eg enumerating a postings lists
>>> would work correctly?  You'd have to remap docIDs on the fly, right?
>>
>> Yes, on the fly. I think it boils down to 3 tasks:
>>
>> 1. Map "view" doc-ids to surrogate doc-ids. Supports input, e.g. referring
>>  to a doc by id.
>> 2. Map surrogate doc-ids to "view" doc-ids. Supports output.
>> 3. Re-sort postings quickly on output, and provide correct
>> TermDocs.skipTo() semantics.
>>
>> For (1), I think an unordered HashMap<Integer,Integer> is good enough.
>> For (2) and (3), something like a TreeMap<Integer,Integer> I imagine should
be
>> good enough too.  The sorted order of its keys (surrogate doc-ids)
>> should also allow
>> us to intersect the set of surrogate docs quickly with the already
>> sorted outbound
>> postings data, for example.
>
> OK though for an index with many documents, that's gonna use up alot
> of RAM, especially the Integer overhead.  Though, these will be sparse
> right?  So it only uses RAM for docs that have updates, and when
> segment merging hasn't merged them away.
>
>> Both maps would be built by replaying the surrogate docs log file.
>
> It sounds like this'd require the full posting list for a given term X
> to be loaded into RAM, mapped, and sorted, before the caller can
> actually pull the first docID, right?  I guess often this is likely an
> acceptable cost, since "usually" the fields you update are smallish.
>
> Thinking out loud... I wonder if, instead of deferring the mapping
> until search time (which is costly since every search must re-do it),
> we could do the mapping at indexing time?
>
> Say we add an IndexWriter.updateDocumentFields(Term, Document) method,
> which will update only the specified fields.  The Term must uniquely
> refer to 1 existing doc in the index.  On flush, we resolve that Term
> -> docID (we do this already, to apply deletions), and then we write
> to a new set of "delta postings" files using the correct docID.  Then
> the reader would have to merge these postings files when reading.
> We'd also have generations...
>
>> Maybe I should first try to bang out a prototype that works at the
>> index level, not the
>> segment level?
>
> Sounds great!
>
> Mike
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message