lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Babak Farhang <farh...@gmail.com>
Subject Re: Incremental Field Updates
Date Fri, 02 Apr 2010 06:50:13 GMT
[Late to this party, but thought I'd chime in]

I think this "layer" concept is right on.  But I'm wondering about the
life cycle of these layers.  Do layers live forever? Or do they
collapse at some point? (Like, as I think was already pointed out,
deletes are when segments are merged today.)

-Babak

On Sat, Mar 27, 2010 at 5:25 AM, Grant Ingersoll <gsingers@apache.org> wrote:
> First off, this is something I've had in my head for a long time, but don't
> have any code.
> As many of you know, one of the main things that vexes any search engine
> based on an inverted index is how to do fast updates of just one field w/o
> having to delete and re-add the whole document like we do today.   When I
> think about the whole update problem, I keep coming back to the notion of
> Photoshop (or any other real photo editing solution) Layers.  In a photo
> editing solution, when you want to hide/change a piece of a photo, it is
> considered best practice to add a layer over that part of the photo to be
> changed.  This way, the original photo is maintained and you don't have to
> worry about accidentally damaging the area you aren't interested in.  Thus,
> a layer is essentially a mask on the original photo. The analogy isn't quite
> the same here, but nevertheless...
>
> So, thinking out loud here and I'm not sure on the best wording of this:
>
> When a document first comes in, it is all in one place, just as it is now.
> Then, when an update comes in on a particular field, we somehow mark in the
> index that the document in question is modified and then we add the new
> change onto the end of the index (just like we currently do when adding new
> docs, but this time it's just a doc w/ a single field). Then, when
> searching, we would, when scoring the affected documents, go to a secondary
> process that knew where to look up the incremental changes. As background
> merging takes place, these "disjoint" documents would be merged back
> together. We'd maybe even consider a "high update" merge scheduler that
> could more frequently handle these incremental merges.
>
> I'm not sure where we would maintain the list of changes.  That is, is it
> something that goes in the posting list, or is it a side structure.  I think
> in the posting list would be to slow.  Also, perhaps it is worthwhile for
> people to indicate that a particular field is expected to be updated while
> others maintain their current format so as not to incur the penalty on each.
>
>  In a sense, the old field for that document is masked by the new field. I
> think, given proper index structure, that we maybe could make that marking
> of the old field fast (maybe it's a pointer to the new field, maybe it's
> just a bit indicating to go look in the "update" segment)
>
> On the search side, I think performance would still be maintained b/c even
> in high update envs. you aren't usually talking about more than a few
> thousand changes in a minute or two and the background merger would be
> responsible for keeping the total number of disjoint documents low.
>
> I realize there isn't a whole lot to go on here just yet, but perhaps it
> will spawn some questions/ideas that will help us work it out in a better
> way.
> At any rate, I think adding incr. field update capability would be a huge
> win for Lucene.
> -Grant

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message