lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Ingersoll <gsing...@apache.org>
Subject Re: Incremental Field Updates
Date Sun, 28 Mar 2010 23:51:20 GMT

On Mar 27, 2010, at 11:14 AM, Mark Harwood wrote:

> Of course introducing the idea of updates also introduces the notion of a primary key
and there's probably an entirely separate discussion to be had around user-supplied vs Lucene-generated
keys.

Not sure I see that need.  Can you explain your reasoning a bit more?

> 
> That aside, the biggest concern for me here is the impact that this is likely to have
on search -  currently queries such as "a:1 AND b:2" are streamed efficiently when evaluated
because fields a and b have long postings lists conveniently sorted in doc-id insertion order
that can be walked in sequence. If there are to be disjoint, partial  docs, with updated contents
arriving out-of-primary-key-order this is bound to introduce costly disk seeks to the query
process or require commit-time merges/sorts to preserve the doc-ordered posting lists needed
to maintain search speed. Both of these strategies come at a reasonable cost. Of course some
form of RAM-based value caching (allowing us to randomly look up the latest value for field
b in doc x) is fast but probably only suited to small-scale deployments.

Indeed, part of me thinks this is especially suited for flex indexing, where I can make a
design time decision to pay the cost in exchange for high updates at the cost of potentially
slower search.


> 
> It's probably worth thinking through the scenarios we want to cater for. Maybe a Digg-like
scenario with users voting on document popularity *can* be catered for with RAM-based field
caches because the data (count of votes) is small enough to cache? 

Agreed.  Many social applications require updating one or two fields very frequently (popularity,
ratings, votes, etc.)


> 
> Cheers,
> Mark
> 
> 
> On 27 Mar 2010, at 11:25, Grant Ingersoll wrote:
> 
>> First off, this is something I've had in my head for a long time, but don't have
any code.
>> 
>> As many of you know, one of the main things that vexes any search engine based on
an inverted index is how to do fast updates of just one field w/o having to delete and re-add
the whole document like we do today.   When I think about the whole update problem, I keep
coming back to the notion of Photoshop (or any other real photo editing solution) Layers.
 In a photo editing solution, when you want to hide/change a piece of a photo, it is considered
best practice to add a layer over that part of the photo to be changed.  This way, the original
photo is maintained and you don't have to worry about accidentally damaging the area you aren't
interested in.  Thus, a layer is essentially a mask on the original photo. The analogy isn't
quite the same here, but nevertheless...
>> So, thinking out loud here and I'm not sure on the best wording of this: 
>> 
>> When a document first comes in, it is all in one place, just as it is now. Then,
when an update comes in on a particular field, we somehow mark in the index that the document
in question is modified and then we add the new change onto the end of the index (just like
we currently do when adding new docs, but this time it's just a doc w/ a single field). Then,
when searching, we would, when scoring the affected documents, go to a secondary process that
knew where to look up the incremental changes. As background merging takes place, these "disjoint"
documents would be merged back together. We'd maybe even consider a "high update" merge scheduler
that could more frequently handle these incremental merges.   
>> 
>> 
>> I'm not sure where we would maintain the list of changes.  That is, is it something
that goes in the posting list, or is it a side structure.  I think in the posting list would
be to slow.  Also, perhaps it is worthwhile for people to indicate that a particular field
is expected to be updated while others maintain their current format so as not to incur the
penalty on each.
>>  In a sense, the old field for that document is masked by the new field. I think,
given proper index structure, that we maybe could make that marking of the old field fast
(maybe it's a pointer to the new field, maybe it's just a bit indicating to go look in the
"update" segment)
>> 
>> On the search side, I think performance would still be maintained b/c even in high
update envs. you aren't usually talking about more than a few thousand changes in a minute
or two and the background merger would be responsible for keeping the total number of disjoint
documents low.
>> 
>> I realize there isn't a whole lot to go on here just yet, but perhaps it will spawn
some questions/ideas that will help us work it out in a better way.
>> 
>> At any rate, I think adding incr. field update capability would be a huge win for
Lucene.
>> 
>> -Grant
> 



Mime
View raw message