lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Robert Muir (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (LUCENE-4221) CheckIndex is overeager for term vector offsets bounds checks
Date Sun, 15 Jul 2012 02:54:34 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-4221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13414566#comment-13414566
] 

Robert Muir commented on LUCENE-4221:
-------------------------------------

This patch disables all the offsets checks for term vectors.

I'd like a plan to start enforcing this stuff in IndexWriter for term vectors as well so we
can actually have these checks on at some point in the future. Sure maybe its annoying that
things like ngrams violate all these rules and will fail if term vectors are on, but these
are broken analyzers that need to be fixed and we shouldn't allow bogus data in the index.

The problem with the current situation (besides checkindex), is if someone has such bogus
offsets in an older index
and they try to use something like Highlighter they will just trip errors from OffsetAttribute,
etc. So they won't really work.

Best idea i have so far:
# Fix LUCENE-4180 so that we can differentiate between 4.0-alpha and 4.0-beta indexes
# Change default term vectors merge impl to buffer one doc in RAM, if it has invalid offsets,
clear the offsets bit and dont write them.
# Only enable bulk merge for 4.x codec, when the segment was written by 4.0-beta+, otherwise
just call super.merge

One downside is that we must keep the one-doc buffering (part 2) even in trunk until 6.x to
support 4.0-alpha indexes, but its too late now.

                
> CheckIndex is overeager for term vector offsets bounds checks
> -------------------------------------------------------------
>
>                 Key: LUCENE-4221
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4221
>             Project: Lucene - Java
>          Issue Type: Bug
>    Affects Versions: 4.0-ALPHA
>            Reporter: Robert Muir
>             Fix For: 4.0, 5.0
>
>         Attachments: LUCENE-4221.patch
>
>
> In some situations (like running shingles twice), you end out with a case where startOffset
> endOffset.
> We prevent this in IndexWriter for postings offsets, but we never do any validation here
for term vectors (at some point, maybe we should make a plan to address this?)
> Anyway, currently CheckIndex will wrongly fail in this situation, which some of our own
analyzers even do (e.g. LUCENE-3920)...
> This is an overly-eager validation in checkindex (for vectors, we cannot safely do these
assertions as it was/is never enforced by IndexWriter, only for postings offsets).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message