lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Robert Muir (JIRA)" <>
Subject [jira] [Commented] (LUCENE-4557) Indexed Offsets Can Be Lost During Merge
Date Wed, 14 Nov 2012 20:54:13 GMT


Robert Muir commented on LUCENE-4557:

i also disagree with calling this "fake" data
the data would be 100% representative of what was indexed

Well, its fake in the sense that this segment now advertises that it was indexed with things
like positions, but in fact it has no positions: its essentially corrupt.

So for example, if you populate bogus frequencies, positions, offsets, and so on, and also
have term vectors, then later when you run CheckIndex it will report the segment is corrupt
these values disagree.

This is because CheckIndex (for good reasons) exploits any possible redundancies in the index
to detect if data is wrong.

if this is a more palatable approach for you, i can work up a patch as i find time

This is definitely more palatable, though sneaky that it wouldnt be called if you did addIndexes(IR)?
So maybe the proposed method should take AtomicReader...

This wrapping would just need to be smart (a good MergeSegmentReader base class that SegmentMerger
is integrated with) in order to optimize bulk merges of stored fields/termvectors/etc

I don't think this is a good idea: I think in your case you would just return the original
unless you needed to 'migrate something'. This way you get the bulk merging optimizations
when its safe
but not when its unsafe. 

For example if you are lying about positions or offsets, then you need to ensure the vectors
are consistent
too (or drop them). Its not safe to bulk merge them.

> Indexed Offsets Can Be Lost During Merge
> ----------------------------------------
>                 Key: LUCENE-4557
>                 URL:
>             Project: Lucene - Core
>          Issue Type: Bug
>    Affects Versions: 4.0
>            Reporter: Tim Smith
>         Attachments:
> Primary Use case:
> Start with pre-4.0 index (no indexed offsets available)
> Start indexing new documents with indexed offsets (IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS,
previously was IndexOptions.DOCS_AND_FREQS_AND_POSITIONS)
> merge/optimize index
> newly indexed documents will now no longer have offsets available
> In general, it is impossible to ever change a field to have offsets indexed when starting
with an existing index as a merge will cause offsets to be removed from the index.
> Desirable behavior would be for new documents to have offsets indexed properly, and old
documents would have offset of "0, 0" for all positions after merging with a segment that
contains offsets
> Current behavior can be very dangerous.
> for example:
> * Start indexing documents with indexed offsets
> * change config to not index offsets by accident
> * index 1 document
> * revert config back
> * offsets will start disappearing from documents as segments are merged

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message