commons-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gary Gregory <garydgreg...@gmail.com>
Subject Re: [IO] BOMInputStream bug?
Date Fri, 10 Aug 2012 22:50:32 GMT
On Fri, Aug 10, 2012 at 4:27 PM, Niall Pemberton
<niall.pemberton@gmail.com>wrote:

> On Fri, Aug 10, 2012 at 6:44 PM, Gary Gregory <garydgregory@gmail.com>
> wrote:
> > Hi All:
> >
> > Does anyone have expertise with BOMInputStream?
> >
> > I know that some XML parsers (like the one shipped with the Oracle JRE)
> do
> > not detect UTF-32 BOMs (UTF-8 and UTF-16 BOMs are OK) but using
> > BOMInputStream is supposed to fix the issue.
> >
> > These tests I added and @Ignore'd fail:
> >
> >    -
> >
>  org.apache.commons.io.input.BOMInputStreamTest.testReadXmlWithBOMUtf32Be()
> >    -
> >
>  org.apache.commons.io.input.BOMInputStreamTest.testReadXmlWithBOMUtf32Le()
> >
> > More basic tests do work:
> >
> >    -
> org.apache.commons.io.input.BOMInputStreamTest.testReadWithBOMUtf32Be()
> >    -
> org.apache.commons.io.input.BOMInputStreamTest.testReadWithBOMUtf32Le()
> >
> > When I look at the Oracle JRE (which uses a copy of Xerces) I see code to
> > deal with UCS-4, which is a precursor to UTF-32, like UCS-2 is a subset
> to
> > UTF-16, but as the test shows, Xerces fail parsing a UTF-32 document.
> >
> > Any thoughts?
>
> Hi Gary,
>
> I enabled the test and ran them. I'm a bit confused about what the
> issue is because the lines that use the BOMInputStream to *skip* the
> UTF-32 BOM do not fail for me:
>
>         parseXml(new BOMInputStream(createUtf32BeDataStream(data,
> true), ByteOrderMark.UTF_32BE));
>         parseXml(new BOMInputStream(createUtf32LeDataStream(data,
> true), ByteOrderMark.UTF_32LE));
>
> whereas the lines after those that do not use any Commons IO components
> fail:
>
>         parseXml(createUtf32BeDataStream(data, true));
>         parseXml(createUtf32LeDataStream(data, true));
>
> So this just means that the XML parser doesn't deal with UTF-32 BOM.
>
> Really though the BOMInputStream stream doesn't provide anything that
> helps parse the XML properly - it has two purposes 1) BOM detection
> and 2) BOM removal/skipping.
>
> What we do have in Commons is XMLInputStream - this uses various
> techniques to detect encoding, including using BOMInputStream to try
> BOM detection and then uses that encoding to with a Reader to process
> the bytes properly
>

Do you mean XmlStreamReader?

Gary

>
> Niall
>
> > Thank you,
> > Gary
> >
> > --
> > E-Mail: garydgregory@gmail.com | ggregory@apache.org
> > JUnit in Action, 2nd Ed: <http://goog_1249600977>http://bit.ly/ECvg0
> > Spring Batch in Action: <http://s.apache.org/HOq>http://bit.ly/bqpbCK
> > Blog: http://garygregory.wordpress.com
> > Home: http://garygregory.com/
> > Tweet! http://twitter.com/GaryGregory
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
> For additional commands, e-mail: dev-help@commons.apache.org
>
>


-- 
E-Mail: garydgregory@gmail.com | ggregory@apache.org
JUnit in Action, 2nd Ed: <http://goog_1249600977>http://bit.ly/ECvg0
Spring Batch in Action: <http://s.apache.org/HOq>http://bit.ly/bqpbCK
Blog: http://garygregory.wordpress.com
Home: http://garygregory.com/
Tweet! http://twitter.com/GaryGregory

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message