lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Ingersoll <gsing...@apache.org>
Subject Incremental Field Updates
Date Sat, 27 Mar 2010 11:25:08 GMT
First off, this is something I've had in my head for a long time, but don't have any code.

As many of you know, one of the main things that vexes any search engine based on an inverted
index is how to do fast updates of just one field w/o having to delete and re-add the whole
document like we do today.   When I think about the whole update problem, I keep coming back
to the notion of Photoshop (or any other real photo editing solution) Layers.  In a photo
editing solution, when you want to hide/change a piece of a photo, it is considered best practice
to add a layer over that part of the photo to be changed.  This way, the original photo is
maintained and you don't have to worry about accidentally damaging the area you aren't interested
in.  Thus, a layer is essentially a mask on the original photo. The analogy isn't quite the
same here, but nevertheless...
So, thinking out loud here and I'm not sure on the best wording of this: 

When a document first comes in, it is all in one place, just as it is now. Then, when an update
comes in on a particular field, we somehow mark in the index that the document in question
is modified and then we add the new change onto the end of the index (just like we currently
do when adding new docs, but this time it's just a doc w/ a single field). Then, when searching,
we would, when scoring the affected documents, go to a secondary process that knew where to
look up the incremental changes. As background merging takes place, these "disjoint" documents
would be merged back together. We'd maybe even consider a "high update" merge scheduler that
could more frequently handle these incremental merges.   


I'm not sure where we would maintain the list of changes.  That is, is it something that goes
in the posting list, or is it a side structure.  I think in the posting list would be to slow.
 Also, perhaps it is worthwhile for people to indicate that a particular field is expected
to be updated while others maintain their current format so as not to incur the penalty on
each.
 In a sense, the old field for that document is masked by the new field. I think, given proper
index structure, that we maybe could make that marking of the old field fast (maybe it's a
pointer to the new field, maybe it's just a bit indicating to go look in the "update" segment)

On the search side, I think performance would still be maintained b/c even in high update
envs. you aren't usually talking about more than a few thousand changes in a minute or two
and the background merger would be responsible for keeping the total number of disjoint documents
low.

I realize there isn't a whole lot to go on here just yet, but perhaps it will spawn some questions/ideas
that will help us work it out in a better way.

At any rate, I think adding incr. field update capability would be a huge win for Lucene.

-Grant
Mime
View raw message