lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael McCandless (JIRA)" <j...@apache.org>
Subject [jira] Created: (LUCENE-2016) replace invalid U+FFFF character during indexing
Date Thu, 29 Oct 2009 17:06:59 GMT
replace invalid U+FFFF character during indexing
------------------------------------------------

                 Key: LUCENE-2016
                 URL: https://issues.apache.org/jira/browse/LUCENE-2016
             Project: Lucene - Java
          Issue Type: Bug
    Affects Versions: 2.9, 2.4.1, 2.4
            Reporter: Michael McCandless
            Assignee: Michael McCandless
             Fix For: 2.9.1, 3.0


If the invalid U+FFFF character is embedded in a token, it actually causes indexing to silently
corrupt the index by writing duplicate terms into the terms dict.  CheckIndex will catch the
error, and merging will hit exceptions (I think).

We already replace invalid surrogate pairs with the replacement character U+FFFD, so I'll
just do the same with U+FFFF.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message