lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Robert Muir (JIRA)" <>
Subject [jira] Commented: (LUCENE-2016) replace invalid U+FFFF character during indexing
Date Thu, 29 Oct 2009 18:11:59 GMT


Robert Muir commented on LUCENE-2016:

But if we forcefully map all invalid-for-interchange unicode characters to the replacement
character (I think that's what's being proposed, right?), then your app no longer has any
characters it can use for its own "internal" purposes?

This is not true. if you map them to replacement characters, then my app is free to use them
"process-internally" as specified by the standard, without any concern that they will appear
in the "interchange" (lucene index data).

I agree with you, lets open a separate "anal unicode issue". Lets go with your U+FFFF fix
for Lucene 2.9, since it fixes lucene java, but correct this for 3.x in the future?

> replace invalid U+FFFF character during indexing
> ------------------------------------------------
>                 Key: LUCENE-2016
>                 URL:
>             Project: Lucene - Java
>          Issue Type: Bug
>    Affects Versions: 2.4, 2.4.1, 2.9
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>             Fix For: 2.9.1, 3.0
>         Attachments: LUCENE-2016.patch
> If the invalid U+FFFF character is embedded in a token, it actually causes indexing to
silently corrupt the index by writing duplicate terms into the terms dict.  CheckIndex will
catch the error, and merging will hit exceptions (I think).
> We already replace invalid surrogate pairs with the replacement character U+FFFD, so
I'll just do the same with U+FFFF.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message