xerces-j-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ian Upright (JIRA)" <xerces-j-...@xml.apache.org>
Subject [jira] [Commented] (XERCESJ-1257) buffer overflow in UTF8Reader for characters out of BMP
Date Thu, 21 Jan 2016 17:24:39 GMT

    [ https://issues.apache.org/jira/browse/XERCESJ-1257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15110954#comment-15110954
] 

Ian Upright commented on XERCESJ-1257:
--------------------------------------

For those using mwdumper to load wikipedia or other sources and encountering this issue, this
change seemed to fix it.  (also serves as an example of how to workaround it)  However, it
would be good to have the real issue addressed.  I would vote to modify Xerces to simply use
the JVM to decode UTF-8 as Michael suggested.

        public void readDump() throws IOException {
                try {
                        SAXParserFactory factory = SAXParserFactory.newInstance();
                        SAXParser parser = factory.newSAXParser();
                        Reader reader = new InputStreamReader(input,"UTF-8");
                        InputSource is = new InputSource(reader);
                        is.setEncoding("UTF-8");
                        parser.parse(is, this);
                } catch (ParserConfigurationException e) {
                        throw (IOException)new IOException(e.getMessage()).initCause(e);
                } catch (SAXException e) {
                        throw (IOException)new IOException(e.getMessage()).initCause(e);
                }
                writer.close();
        }


> buffer overflow in UTF8Reader for characters out of BMP
> -------------------------------------------------------
>
>                 Key: XERCESJ-1257
>                 URL: https://issues.apache.org/jira/browse/XERCESJ-1257
>             Project: Xerces2-J
>          Issue Type: Bug
>          Components: JAXP (javax.xml.parsers)
>    Affects Versions: 2.9.0
>         Environment: Any
>            Reporter: Robert Stojnic
>            Assignee: Michael Glavassevich
>            Priority: Minor
>         Attachments: TestXerces.java, UTF8Reader.patch, XERCESJ-1257_tests.patch
>
>
> There is a ArrayOutOfBoundsException in org.apache.xerces.impl.io.UTF8Reader, in read(char[],int,int)
for 4-byte utf-8 chars.
> Imagine a following scenario. read() has a buffer of size N, and it reads N-1 ascii chars,
and stores it in the output buffer. Let the Nth char be the first byte of a 4 byte utf-8 char.
The other 3 bytes are fetched by invoking read() on the input stream. From these a surrogate
pair of java chars is made, however, method does not check if both chars can fit into the
output buffer ... In most cases, they would fit into the ouput buffer (e.g. if there are some
other multi-byte chars in the fetched text), so the bug is very rare, but it still happens.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: j-dev-unsubscribe@xerces.apache.org
For additional commands, e-mail: j-dev-help@xerces.apache.org


Mime
View raw message