lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael McCandless (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (LUCENE-4258) Incremental Field Updates through Stacked Segments
Date Thu, 20 Dec 2012 12:47:12 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-4258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13537000#comment-13537000
] 

Michael McCandless commented on LUCENE-4258:
--------------------------------------------


{quote}
After rethinking the point-of-inversion issue, seems like the right time to do it is ASAP
- not to hold the added fields and invert them later, but rather invert them immediately and
save their inverted version. 3 reasons for that:
1. Take out the constraint I inserted to the API, so update fields can be reused and contain
Reader/TokenStrem,
2. NRT support: we cannot search until we invert, and if we invert earlier NRT support will
be less complicated, probably some variation on multi-reader to view uncommitted updates,
3. You are correct that we currently do not account for the RAM usage of the FieldsUpdate,
since I thought using RAMUsageEstimator will be too costly. It will probably be more efficient
to calculate RAM usage of the inverted fields, maybe even during inversion?
{quote}

+1

I would also add "4. Inversion of updates is single-threaded", ie once
we move inversion into .updateFields it will be multi-threaded again.

bq. So my question in that regard is how can I invert a document and hold its inverted form
to be used by NRT and later inserted into stacked segment? Should I create a temporary Directory
and invert into it? Is there another way to do this?

I think we should somehow re-use the existing code that inverts (eg
FreqProxTermsWriter)?  Ie, invert into an in-RAM segment, with
"temporary" docIDs, and then when it's time to apply the updates, you
need to rewrite the postings to disk with the re-mapped docIDs.

I wouldn't do anything special for NRT for starters, meaning, from
NRT's standpoint, it opens these stacked segments from disk as it
would if a new non-NRT reader was being opened.  So I would leave that
TODO in SegmentReader as a TODO for now :)  Later, we can optimize
this and have updates carry in RAM like we do for deletes, but I
wouldn't start with that ...

{quote}
bq. Merging is very important. Hmm, are we able to just merge all updates down to a single
update? Ie, without merging the base segment? We can't express that today from MergePolicy
right? In an NRT setting this seems very important (ie it'd be best bang (= improved search
performance) for the buck (= merge cost)).

Shai is helping in creation of a benchmark to test performance in various scenarios. I will
start adding updates aspects to the merge policy. I am not sure if merging just updates of
a segment is feasible. In what cases would it be better than collapsing all updates into the
base segment?
{quote}

Imagine a huge segment that's accumulating updates ... say it has 20
stacked segments.  First off, those stacked segments are each tying up
N file descriptors on open, right?  (Well, only one if it's CFS).  But
second off, I would expect search perf with 1 base + 20 stacked is
worse than 1 base + 1 stacked?  We need to test if that's true
... it's likely that the most perf loss is going from no stacked
segments to 1 stacked segment ... and then going from 1 to 20 stacked
segments doesn't hurt "that much".  We have to test and see.

Simply merging that big base segment with its 20 stacked segments is
going to be too costly to do very often.

{quote}
bq. I think we need a test that indexes a known (randomly generated) set of documents, randomly
sometimes using add and sometimes using update/replace field, mixing in deletes (just like
TestField.addDocuments()), for the first index, and for the second index only using addDocument
on the "surviving" documents, and then we assertIndexEquals(...) in the end? Maybe we can
factor out code from TestDuelingCodecs or TestStressIndexing2.

TestFieldReplacements already had a test which randomly adds documents, replaces documents,
adds fields and replaces fields. I refactored it to enable using a seed, and created a "clean"
version with only addDocument(...) calls. However, the FieldInfos of the "clean" version do
not include things that the "full" version includes because in the full version fields possessing
certain field traits where added and then deleted. I will look at the other suggestions.
{quote}

It should be fine if the FieldInfos don't match?  Ie, when comparing
the two indices we should not compare field numbers?  We should be
comparing by only external things like fieldName, which id we had
indexed, etc.

                
> Incremental Field Updates through Stacked Segments
> --------------------------------------------------
>
>                 Key: LUCENE-4258
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4258
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: core/index
>            Reporter: Sivan Yogev
>         Attachments: IncrementalFieldUpdates.odp, LUCENE-4258-API-changes.patch, LUCENE-4258.r1410593.patch,
LUCENE-4258.r1412262.patch, LUCENE-4258.r1416438.patch, LUCENE-4258.r1416617.patch, LUCENE-4258.r1422495.patch,
LUCENE-4258.r1423010.patch
>
>   Original Estimate: 2,520h
>  Remaining Estimate: 2,520h
>
> Shai and I would like to start working on the proposal to Incremental Field Updates outlined
here (http://markmail.org/message/zhrdxxpfk6qvdaex).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message