xml-xalan-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Brian Minchau (JIRA)" <xalan-...@xml.apache.org>
Subject [jira] Commented: (XALANJ-2419) Astral characters written as a pair of NCRs with the surrogate scalar values when using UTF-8
Date Wed, 03 Sep 2008 20:23:44 GMT

    [ https://issues.apache.org/jira/browse/XALANJ-2419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12628122#action_12628122

Brian Minchau commented on XALANJ-2419:

Henri, Richard:
you are right, the code there should have checked if ch was a high in a high low/pair, and

m_encodingInfo.isInEncoding(high,low) should have been called in this case.

Definite bug. I'll see if I can supply a patch.

> Astral characters written as a pair of NCRs with the surrogate scalar values when using
> ---------------------------------------------------------------------------------------------
>                 Key: XALANJ-2419
>                 URL: https://issues.apache.org/jira/browse/XALANJ-2419
>             Project: XalanJ2
>          Issue Type: Bug
>          Components: Serialization
>    Affects Versions: 2.7.1
>            Reporter: Henri Sivonen
> org.apache.xml.serializer.ToStream contains the following code:
>                     else if (m_encodingInfo.isInEncoding(ch)) {
>                         // If the character is in the encoding, and
>                         // not in the normal ASCII range, we also
>                         // just leave it get added on to the clean characters
>                     }
>                     else {
>                         // This is a fallback plan, we should never get here
>                         // but if the character wasn't previously handled
>                         // (i.e. isn't in the encoding, etc.) then what
>                         // should we do?  We choose to write out an entity
>                         writeOutCleanChars(chars, i, lastDirtyCharProcessed);
>                         writer.write("&#");
>                         writer.write(Integer.toString(ch));
>                         writer.write(';');
>                         lastDirtyCharProcessed = i;
>                     }
> This leads to the wrong (latter) if branch running for surrogates, because isInEncoding()
for UTF-8 returns false for surrogates. It is always wrong (regardless of encoding) to escape
a surrogate as an NCR.
> The practical effect of this bug is that any document with astral characters in it ends
up in an ill-formed serialization and does not parse back using an XML parser.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

To unsubscribe, e-mail: xalan-dev-unsubscribe@xml.apache.org
For additional commands, e-mail: xalan-dev-help@xml.apache.org

View raw message