lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael McCandless <luc...@mikemccandless.com>
Subject Re: Modifying a document by updating a payloads?
Date Thu, 31 Jul 2008 09:33:55 GMT

Antony Bowesman wrote:

> Hi Mike,
>
>> Unfortunately you will have to delete the old doc, then reindex a  
>> new doc, in order to change any payloads in the document's Tokens.
>> This issue:
>>    https://issues.apache.org/jira/browse/LUCENE-1231
>> which is still in progress, could make updating stored (but not  
>> indexed) fields a much lower cost operation, but that's not for  
>> sure and it's not clear when that issue will be done.
>
> Michael Busch's Apache Con (2006/7??) presentation summarized with  
> the bullet
>
> "Per-document Payloads – updateable"

Ahh -- this is just another name for "column-stride fields" (which is  
the above issue I linked to).

Normal payloads are per term occurrence, ie, every position in the  
document can have its own payload.

Whereas "per-document payloads" means there is a single payload per  
field in the document, which logically is no different than a stored  
field, except the underly storage would be more efficient (column- 
stride, where that field's value for all docs is stored together vs  
the normal row-stride used by current stored fields, where all field  
values for a single document are stored together).

> Is making a document 'updatable' (in _some_ way) something still  
> seen as a long term goal for Lucene?

I would say it is a goal in that there is alot of interest and  
discussion around how to do this.  I think LUCENE-1231 is the most  
concrete recent effort & most likely to be the first path that makes  
updating documents possible.

> As far as implementation is concerned, if a stored (not indexed)  
> field may be updatable with 1231, is there some difficulty with  
> making payloads, which from my understanding are attributed to a  
> posting of an indexed field, updatable.  I guess they ultimately  
> equate to the same thing - i.e. using a stored field to hold the  
> document's "payload", but it would be an extra field to load.

Updating the postings lists (freq/prx&payloads) is unfortunately quite  
a bit trickier than updating a column-stride or row-stride stored  
fields.

I think the approach we need to eventually take is to allow "patches"  
onto a segments posting lists.

For example, segment _X would have the original large _X.frq/prx but  
then could have say _X_1.frq/prx which is a much smaller file  
containing postings for those docs that have been updated since the  
segment was originally created.  If more docs are updated that would  
produce _X_2.frq/prx, etc.

IndexReaders would then need to hold open all of these postings and  
dynamically "apply" the patch such that a doc's postings are iterated  
from the newest frq/prx file that it exists in.  Optimize() and  
partial optimize() would then coalesce these files back into 1 (or  
maybe a few) frq/prx files.

At least that's my current thinking on how we would approach updating  
postings... but realistically these are just thoughts and are quite a  
ways off from becoming a reality!

Mike
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message