lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Shai Erera (JIRA)" <>
Subject [jira] Commented: (LUCENE-1585) Allow to control how payloads are merged
Date Thu, 06 May 2010 21:24:48 GMT


Shai Erera commented on LUCENE-1585:

I went over the tests and realized I didn't write one which adds indexes into an already populated
index. Ideally, the payloads in the existing index should not be re-processed b/c of the external
ones that are added. But this doesn't happen, as addIndexes and addIndexesNoOpt don't distinguish
well between local and external segments. It all boils down to IW,merge() which calls SM.merge()
Then I figured a single PayloadConsumer "might not fit all" - e.g. there are cases where different
PCs are needed for different indexes. The app can call addIndexes one at a time, but that's
not efficient. So I think the entry-level API should be a PayloadConsumerProvider, which declares
one getPayloadConsumer(Directory) method. It returns a PC corresponding to a Directory. It
gives the app the freedom it needs to:
* Always return the same PC for all Dirs.
* Return different PCs for different Dirs.
* Return null for some Dirs, so that their payloads are not re-processed.

Setting out to impl that, I've noticed addIndexes and addIndexesNoOpt behave differently.
While addIndexes interacts w/ the SegmentMerger directly (and hence can easily pass it the
PCP), NoOpt reads the SIs from the given Dirs, call maybeMerge(), which triggers SM.merge(),
to merge local + external segments. We cannot pass PCP to maybeMerge since that won't help
- the call chain hits MergeScheduler, which loops-back at us when it calls IW.merge() .. seems
way too complicated.
Additionally, there is no way to guarantee that PCP won't be invoked during addIndexesNoOpt
on local segments (unless it does not provide a PC for the target Dir) ...

Therefore, I'd like to add PCP to IWC, for the following reasons:
* As I said above, there's no way to guarantee it won't be invoked on local segments when
*NoOpt is called.
* There's no clean way to ensure NoOpt passes it on to SM, w/o passing PCP through MergeScheduler.
* It might be useful for apps that want to rewrite their payloads only over time -- sort of
a mini app-level migration tool (of just payloads).
* It cleans the API - does not affect 'backwards', no need to pass it on through several methods
until it gets to SM -- simplifies the solution.

This is an expert API. Therefore, apps that set it probably know what they're doing. Therefore
I believe they will be able to understand how to not invoke their PCs on the target dir's

What do you think?

> Allow to control how payloads are merged
> ----------------------------------------
>                 Key: LUCENE-1585
>                 URL:
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Index
>            Reporter: Michael Busch
>            Assignee: Shai Erera
>            Priority: Minor
>             Fix For: 3.1, 4.0
>         Attachments: LUCENE-1585_3x.patch, LUCENE-1585_trunk.patch
> Lucene handles backwards-compatibility of its data structures by
> converting them from the old into the new formats during segment
> merging. 
> Payloads are simply byte arrays in which users can store arbitrary
> data. Applications that use payloads might want to convert the format
> of their payloads in a similar fashion. Otherwise it's not easily
> possible to ever change the encoding of a payload without reindexing.
> So I propose to introduce a PayloadMerger class that the SegmentMerger
> invokes to merge the payloads from multiple segments. Users can then
> implement their own PayloadMerger to convert payloads from an old into
> a new format.
> In the future we need this kind of flexibility also for column-stride
> fields (LUCENE-1231) and flexible indexing codecs.
> In addition to that it would be nice if users could store version
> information in the segments file. E.g. they could store "in segment _2
> the term a:b uses payloads of format x.y".

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message