lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael McCandless (JIRA)" <>
Subject [jira] [Updated] (LUCENE-4127) negative offsets/deltas corrumption
Date Sun, 10 Jun 2012 14:20:42 GMT


Michael McCandless updated LUCENE-4127:

    Attachment: LUCENE-4127.patch

I think we should also strongly check posIncr coming into IndexWriter ... attached patch does
that and fixes a couple tests that were sending posInc=0 for first token.
> negative offsets/deltas corrumption
> -----------------------------------
>                 Key: LUCENE-4127
>                 URL:
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: core/index
>    Affects Versions: 4.0
>            Reporter: Robert Muir
>         Attachments: LUCENE-4127.patch, LUCENE-4127_test.patch
> If offsets go negative or backwards, it can corrupt the index with DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS:
the offsets will have wrong values (different from the term vectors) or even crazy values
like -2147483645
> The problem with this is that its not just theoretical: its too easy to do this with
lucene's own analyzer chains (e.g. ngramtokenizer).
> See issues such as LUCENE-3920 and some discussion on LUCENE-3738
> The question is how to fix this, e.g. should we:
> # start enforcing that offsets cannot be crazy values in OffsetAttributeImpl/IndexWriter
and fix the broken analyzers
> # leave offsets as a pair of opaque integers, declaring this a limitation of the current
codec, and either workaround or throw UOE from the postings writer.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:!default.jspa
For more information on JIRA, see:


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message