lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shai Erera <ser...@gmail.com>
Subject PayloadProcessorProvider Usage
Date Wed, 13 Apr 2011 17:43:37 GMT
Hey,

In Lucene 3.1 we've introduced PayloadProcessorProvider which allows you to
rewrite payloads of terms during merge. The main scenario is when you merge
indexes, and you want to rewrite/remap payloads of the incoming indexes, but
one can certainly use it to rewrite the payloads of a term, in a given
index.

When we worked on it, we thought of two ways the user can rewrite payloads
when he merges indexes:

1) Set PPP on the target IW, call addIndexes(IndexReader), while PPP will be
applied on the incoming directories only.
2) Set PPP on the source IW, call IW.optimize(), then use
targetIW.addIndexes(Directory).

The latter is better since in both cases the incoming segments are rewritten
anyway, however in the first case you might run into merging segments of the
target index as well, something you might want to avoid (that was the
purpose of optimizing addIndexes(Directory)).

But it turns out the latter is not so easy to achieve. If the source index
has only 1 segment (at least in my case, ~100% of the time), then calling
optimize() doesn't do anything because the MP thinks the index is already
optimized and returns no MergeSpec. To overcome this, I wrote a
ForceOptimizeMP which extends LogMP and forces optimize even if there is
only one segment.

Another option is to set the noCFSRation to 1.0 and flip the useCompoundFile
flag (ie if source is compound, create no compound and vice versa). That can
work too, but I don't think it's very good, because the source index will be
changed from compound to non (or vice versa), which is something that the
app didn't want.

So I think option 1 is better, but I wanted to ask if someone knows of a
better way to achieve this?

Shai

Mime
View raw message