lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sivan Yogev (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (LUCENE-4258) Incremental Field Updates through Stacked Segments
Date Tue, 07 Aug 2012 12:51:09 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-4258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13430322#comment-13430322
] 

Sivan Yogev commented on LUCENE-4258:
-------------------------------------

Working on the details, it seems that we need to add a new layer of information for stacked
segments. For each field that was added with REPLACE_FIELDS, we need to hold the documents
in which a replace took place, with the number of the latest generation that had the replacement.
Name this list the "generation vector". That way, TermDocs provided by StackedSegmentReader
for a certain term is a special merge of that term's TermDocs for all stacked segments. The
"special" part about it is that we ignore occurrences from documents in which the term's field
was replaced in a later generation.

An example. Assume we have doc 1 with title "I love bananas" and doc 2 with title "I love
oranges", and the segment is flushed. We will have the following base segment (ignoring positions):

bananas: doc 1
I: doc1, doc 2
love: doc 1, doc 2
oranges: doc2

Now we add to doc 1 additional title field "I hate apples", and replace the title of doc 2
with "I love lemons", and flush. We will have the following segment for generation 1:

apples: doc 1
hate: doc 1
I: doc 1, doc 2
lemons: doc 2
love: doc 2
generation vector for field "title": (doc 2, generation 1)

TermDocs for a few terms: 
* title:bananas : {1}, uses the TermDocs of the base segment and not affected by the field
title generation vector.
* title:oranges : {}, uses the TermDocs of the base segment, doc 2 title affected for generations
< 1, and the generation is 0.
* title:lemons : {2}, uses the TermDocs of generation 1. Doc 2 title affected for generations
< 1, but the term appears in generation 1.
* title:love : {1,2}, uses the TermDocs of both segments. Doc 2 title affected for generations
< 1, but the term appears in generation 1.

I propose to initially use PackedInts for the generation vector, since we know how many generations
the curent segment has upon flushing. Later we might consider special treatment for sparse
vectors.

                
> Incremental Field Updates through Stacked Segments
> --------------------------------------------------
>
>                 Key: LUCENE-4258
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4258
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: core/index
>            Reporter: Sivan Yogev
>   Original Estimate: 2,520h
>  Remaining Estimate: 2,520h
>
> Shai and I would like to start working on the proposal to Incremental Field Updates outlined
here (http://markmail.org/message/zhrdxxpfk6qvdaex).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message