axis-java-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Steve Onorato (JIRA)" <axis-...@ws.apache.org>
Subject [jira] [Comment Edited] (AXIS-2908) Apache Axis fails to handle non Basic Multilingual Plane characters(U+10000 and above) while creating SOAP request
Date Thu, 18 Jun 2015 21:33:01 GMT

    [ https://issues.apache.org/jira/browse/AXIS-2908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14592545#comment-14592545
] 

Steve Onorato edited comment on AXIS-2908 at 6/18/15 9:32 PM:
--------------------------------------------------------------

I also had this problem.  For example, &#x29d98; (U+29D98) should either be converted
to the UTF-8 byte sequence 0xF0 0xA9 0xB6 0x98 or Numeric Character Reference &amp;#x29d98;
when serialized to XML.
Unfortunately, the UTF-16 surrogates are getting directly converted to &amp;#xD867;&amp;#xDD98;
which is not valid according to both XML 1.0 and 1.1 specs.  As a result, the XML parser receiving
the invalid XML throws an exception.

As a workaround, I applied the patch "AXIS_2342.diff" from https://issues.apache.org/jira/browse/AXIS-2342
- it solves this problem since it avoids the logic that causes the bad Numeric Character References
to be emitted.


was (Author: steveonorato):
I also had this problem.  For example, U+29D98 (see http://www.fileformat.info/info/unicode/char/29d98/index.htm)
should either be converted to the UTF-8 byte sequence 0xF0 0xA9 0xB6 0x98 or Numeric Character
Reference &#x29d98; when serialized to XML.
Unfortunately, the UTF-16 surrogates are getting directly converted to &#xD867;&#xDD98;
which is not valid according to both XML 1.0 and 1.1 specs.  As a result, the XML parser receiving
the invalid XML throws an exception.

As a workaround, I applied the patch "AXIS_2342.diff" from https://issues.apache.org/jira/browse/AXIS-2342
- it solves this problem since it avoids the logic that causes the bad Numeric Character References
to be emitted.

> Apache Axis fails to handle non Basic Multilingual Plane characters(U+10000 and above)
while creating SOAP request
> ------------------------------------------------------------------------------------------------------------------
>
>                 Key: AXIS-2908
>                 URL: https://issues.apache.org/jira/browse/AXIS-2908
>             Project: Axis
>          Issue Type: Bug
>          Components: Serialization/Deserialization
>    Affects Versions: 1.4
>         Environment: OS - CentOS
> Software Platform - JDK 7
>            Reporter: Siddhesh Sundar Toraskar
>              Labels: charset, xml-rpc
>
> While creating SOAP request, if we have nonBMP characters(e.g. EMOJIs), they(EMOJIs)
are not properly inserted in XML.
> It seems that my content which is UTF-8 will be encoded in UTF-16 Java String (default)
once program receives it.
> Apache Axis library that we are using then take those UTF-16 Java Strings and try to
convert back into UTF-8 to create a XML before sending. It fails whenever I send a 4-byte
Emoji (:grin:) UTF-8 character. I found that any UTF-8 4-byte character will be represented
as surrogate pair in UTF-16. I suspect in that case Axis parser not able to understand surrogate
pair and not able to convert into valid UTF-8 encoding.
> As result, while UTF-8 is specified, these EMOJIs appear in UTF-16 form which actually
corrupts them because they are then incorrectly processed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@axis.apache.org
For additional commands, e-mail: java-dev-help@axis.apache.org


Mime
View raw message