xalan-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Steven J. Hathaway (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (XALANC-743) XalanOutputStream::transcode falls into infinite loop on 4 bytes unicode till out of memory
Date Mon, 29 Apr 2013 15:48:15 GMT

    [ https://issues.apache.org/jira/browse/XALANC-743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13644587#comment-13644587
] 

Steven J. Hathaway commented on XALANC-743:
-------------------------------------------

This issue should be for XERCESC instead of XALANC.  We use Xerces for the XML parser and
transcoder library.

I don't know how you were able to put a UTF-16 high-surrogate value at the end of a string
without truncation unless it was a misuse of UTF-8.

- - -

FYI: Unicode Discussion - The valid code points related to UTF-8/UTF-16

Unicode values are restricted to a subset of 21-bit binary values.

Special value 0x0000 is an invalid Unicode code-point.  This value may be used as a terminator
for code-point sequences.  This value should not be encoded as UTF-8 or UTF-16.

Surrogate values for UTF-16 are not valid code-point values.  They are used to translate pairs
of 16-bit quantities into code-point values.

Values larger than 0x10FFFF are not valid code-point values.  This is the limit of the UTF-16
algorithm.

The UTF-8 algorithm can effectively encode 31-bits of binary value, but only the range 0x01
to 0x10FFFF are valid Unicode code-point values.

- - -
Sincerely,
Steven J. Hathaway
                
> XalanOutputStream::transcode falls into infinite loop on 4 bytes unicode till out of
memory
> -------------------------------------------------------------------------------------------
>
>                 Key: XALANC-743
>                 URL: https://issues.apache.org/jira/browse/XALANC-743
>             Project: XalanC
>          Issue Type: Bug
>          Components: XalanC
>    Affects Versions: 1.10
>         Environment: Linux
>            Reporter: Jiangbei Fan
>            Assignee: Steven J. Hathaway
>
> In some rare cases, XalanTransformer::transform would stuck or crash when the input/stylesheet
contains 4-byte unicode. And I traced down the root cause in XalanOutputStream::transcode
> When the transcode buffer contains unicode of size 4 bytes, and the last XalanDOMChar
in the buffer is the first 2 bytes of a 4-byte unicode char. The XalanOutputStream::transcode
will fall into an infinite loop till it is out of memory. As XMLUTF8Transcoder.cpp in xerces
will not consume the last 2-bytes if it is part of 4 byte unicode. And transcode always loop
until all chars in the buffer is eaten. Specifically this will happen when the last XalanDOMChar
 in the input buffer is between 0xD800 and 0xDBFF.
> I cannot find whether this issue has been reported before. This is version 1.10.  I do
have a fix to add a bool reference to the function, so that the caller can push the last 2
byte back to the buffer if not consumed. But want to check it out before submit any fixes.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@xalan.apache.org
For additional commands, e-mail: dev-help@xalan.apache.org


Mime
View raw message