xml-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From George Armhold <armh...@cs.rutgers.edu>
Subject bug in Crimson: parsing UTF-8 chars in DTD comment fields
Date Thu, 06 Sep 2001 18:15:44 GMT
Hi,

I'd like to report what I think is a bug in Crimson (as obtained with
Sun's JAXP 1.1 reference implementation.)  I'm fairly new to XML, and
I may be off-base here, so please bear with me.  I'm trying to parse a
MusicXML document (see http://www.musicxml.org) and Crimson is giving
me

org.xml.sax.SAXParseException: Character conversion error: "Illegal
ASCII character, 0xc2" (line number may be too low).

when it encounters DTD's that have UTF-8 encoded characters in the
comment fields.  In the case of MusicXML, the character is a two-byte
copyright symbol: ©.  I believe that this is correct UTF-8, and that
it should be parsed correctly.  MusicXML is a complex hierarchy of
DTD's, so I've boiled it down to a simple example which I think
demonstrates the problem.  An example document:

    <?xml version="1.0" encoding="UTF-8"?>
    <!DOCTYPE simple PUBLIC
        "-//Armhold//Simple DTD//EN"
        "http://pablo.rutgers.edu/~armhold/dtds/simple.dtd">
    
    <simple>
    </simple>

The referenced simple.dtd contains the following:

    <?xml version="1.0" encoding="UTF-8"?>
    <!--
          A really simple DTD.
          Copyright © 2000-2001.
    -->

When I try to parse this with Sun's example parser
(http://java.sun.com/xml/jaxp-1.1/docs/tutorial/dom/work/DomEcho01.java)
I get the following:

org.xml.sax.SAXParseException: Character conversion error: "Illegal
ASCII character, 0xc2" (line number may be too low).
        at
org.apache.crimson.parser.InputEntity.fatal(InputEntity.java:1038)
        at
org.apache.crimson.parser.InputEntity.fillbuf(InputEntity.java:1010)
        at
org.apache.crimson.parser.InputEntity.peek(InputEntity.java:841)
        at org.apache.crimson.parser.Parser2.peek(Parser2.java:3000)
        at
org.apache.crimson.parser.Parser2.maybeTextDecl(Parser2.java:2725)
        at
org.apache.crimson.parser.Parser2.externalParameterEntity(Parser2.java:2806)
        at
org.apache.crimson.parser.Parser2.maybeDoctypeDecl(Parser2.java:1155)
        at
org.apache.crimson.parser.Parser2.parseInternal(Parser2.java:489)
        at org.apache.crimson.parser.Parser2.parse(Parser2.java:305)
        at
org.apache.crimson.parser.XMLReaderImpl.parse(XMLReaderImpl.java:433)
        at
org.apache.crimson.jaxp.DocumentBuilderImpl.parse(DocumentBuilderImpl.java:185)
        at
javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:161)
        at DomEcho01.main(DomEcho01.java:63)


Removing the copyright chars from my DTD solves the problem.  I'm
using JDK 1.3.0 w/ JAXP 1.1.  Can someone please confirm this as a
bug, or enlighten me as to what I'm doing wrong?

Thanks

--
George Armhold
Rutgers University
Bioinformatics Initiative

---------------------------------------------------------------------
In case of troubles, e-mail:     webmaster@xml.apache.org
To unsubscribe, e-mail:          general-unsubscribe@xml.apache.org
For additional commands, e-mail: general-help@xml.apache.org


Mime
View raw message