lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael McCandless (JIRA)" <>
Subject [jira] Commented: (LUCENE-2016) replace invalid U+FFFF character during indexing
Date Thu, 29 Oct 2009 17:42:00 GMT


Michael McCandless commented on LUCENE-2016:

Lucene has "traditionally" not enforced the "not for interchange"
characters, ie, just let them through.

But then with the indexing speedups (LUCENE-843), we no longer allowed
U+FFFF, and with the cutover to true UTF-8 in the index, we no longer
allowed invalid surrogate pairs.

And we know apps use these characters (because they hit problems with
U+FFFF on upgrading to 2.3).

So I think it would be too anal to suddenly replace all of these
invalid interchange chars, starting today?  (Though, it would
obviously be more "standards compliant").  Plus, it would cost us non
trivial indexing CPU to do so!!

> replace invalid U+FFFF character during indexing
> ------------------------------------------------
>                 Key: LUCENE-2016
>                 URL:
>             Project: Lucene - Java
>          Issue Type: Bug
>    Affects Versions: 2.4, 2.4.1, 2.9
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>             Fix For: 2.9.1, 3.0
>         Attachments: LUCENE-2016.patch
> If the invalid U+FFFF character is embedded in a token, it actually causes indexing to
silently corrupt the index by writing duplicate terms into the terms dict.  CheckIndex will
catch the error, and merging will hit exceptions (I think).
> We already replace invalid surrogate pairs with the replacement character U+FFFD, so
I'll just do the same with U+FFFF.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message