xerces-j-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Glavassevich <mrgla...@engmail.uwaterloo.ca>
Subject Re: XML 1.1 end-of-line oddities and patches
Date Wed, 02 Apr 2003 01:34:25 GMT
Hi Neil,

I think it may be a little more complicated than just starting the version
detector with the 1.0 scanner. Right now the version detector reads the XML
decl up to the end quote after the version attribute value. As Glenn,
pointed out you still don't know the encoding yet, so when the document
scanner gets the XML decl, it still cannot reliably detect NEL or LSEP, and
if only ASCII characters are permitted in the XML decl, the 1.1 scanner
shouldn't be used at all while scanning the XML decl.

Just for some perspective of what other people are doing, the implementor
of RXP (a C parser) rejects NEL and LSEP anywhere in the XML decl.

http://lists.w3.org/Archives/Public/www-xml-blueberry-comments/2003Feb/0001.
html

At 07:47 PM 01/04/2003 -0500, you wrote:
>Hi Michael,
>
>Nicely done!  The attribute value patch looks fine to me; you're also right
>that a grep through the code needs to be undertaken to find other 0x85 0xA
>slips.  I'm also unsure that XML11Chars.isXML11Space(int) needs to remain
>with us, but there's one use of that method in the XML11EntityScanner that
>looks vaguely appropriate; so perhaps it should survive.
>
>I agree with you that the behaviour with respect to spaces in the XML decl
>could use some clarifying; either way you're right about there being a bug
>here.  I'll think about how to fix this; in the very likely event that no
>XML 1.1 newline normalization need be performed here I think we can get
>away with a simpler fix:  simply start the version detector with a 1.0
>scanner; it can happily throw all the whitespace characters it sees away
>and no one need be the wiser.
>
>Cheers,
>Neil
>Neil Graham
>XML Parser Development
>IBM Toronto Lab
>Phone:  905-413-3519, T/L 969-3519
>E-mail:  neilg@ca.ibm.com
>
>
>
>
>|---------+------------------------------->
>|         |           Michael Glavassevich|
>|         |           <mrglavas@engmail.uw|
>|         |           aterloo.ca>         |
>|         |                               |
>|         |           04/01/2003 12:30 AM |
>|         |           Please respond to   |
>|         |           xerces-j-dev        |
>|         |                               |
>|---------+------------------------------->
>
>---------------------------------------------------------------------------
------------------------------------------------------------------|
>  |
                                                                     |
>  |       To:       xerces-j-dev@xml.apache.org
                                                                     |
>  |       cc:
                                                                     |
>  |       Subject:  XML 1.1 end-of-line oddities and patches
                                                                     |
>  |
                                                                     |
>  |
                                                                     |
>
>---------------------------------------------------------------------------
------------------------------------------------------------------|
>
>
>
>
>Hi everyone,
>
>I've spent some time looking at the XML 1.1 candidate rec and I've noticed
>a number of areas within Xerces which seem to behave contrary to the spec.
>The issues that I'm bringing up assume the following:
>
>1) NEL (Unicode 0x85) and LSEP (Unicode 0x2028) are not white space
>characters, because the S production (http://www.w3.org/TR/REC-xml#NT-S) is
>unchanged from XML 1.0 since there's no modification in the 1.1 spec.
>
>2) The amendment to 2.11 End-Of-Line Handling
>(http://www.w3.org/TR/xml11/#sec2.11) means that its possible for 0x85 and
>0x2028 to occur in the XML declaration before the version of the document
>is determined to be 1.1.
>
>3) There's some way to force non-normalized 0x85 and 0x2028 into a document
>using references to paramater/general entities, such that they can appear
>in replacement text as part of markup in places where they're not allowed
>(since they're not white space). For example, between an element name and
>attribute list.
>4) Section 3.3.3 Attribute-Value Normalization
>(http://www.w3.org/TR/REC-xml#AVNormalize) from XML 1.0 is unchanged, so
>even assuming that 0x85 and 0x2028 are white space, they shouldn't be
>replaced with the 0x20 space character.
>
>In general it looks like Xerces is treating 0x85 and 0x2028 as if they were
>white space everywhere, not just in the case described by section 2.11,
>meaning if my third and/or fourth assumptions are correct there are cases
>where these characters are going to be handled incorrectly. I've attached
>some patches to this e-mail for some specific problems I've located.
>
>Patch #1: version-detector
>The parser allows 0x85 and 0x2028 to appear in the XML declaration before
>it determines that the version of the document is 1.0. Since end-of-line
>handling for XML 1.0 documents doesn't include these characters, such
>documents must be invalid. XMLVersionDetector consumes some of the input
>stream, so needs to do some clever fixup of the entity before the document
>scanner gets a hold of it. Unfortunately once the scanner gets the document
>entity, any trace of 0x85 and 0x2028 are gone because they were quitely
>normalized away. This makes it impossible for the 1.0 document scanner to
>detect 0x85 or 0x2028 in places in the XML declaration. It looks like this
>detection must be done in XMLVersionDetector.
>
>My fix first assumes 1.0 end-of-line handling, and then switches to 1.1
>end-of-line handling in order to try to match the segment of the XML
>declaration production '<?xml' S 'version' S? '=' S?. You can use this
>approach to indirectly determine if 0x85 or 0x2028 appears in this part of
>the document, and then emit an error if it's determined that the document
>wasn't version 1.1.
>
>Patch #2 : attribute-normalization
>The parser will replace 0x85 and 0x2028 with 0x20 when normalizing
>attributes. As per my fourth assumption this isn't legal. My patch reverts
>white space replacement to the behaviour in the 1.0 scanner (just
>replacement of 0x20, 0xD, 0xA, 0x9 with 0x20).
>
>Patch #3 : D-85-newline-normalization
>It seems like the character sequence #x85 #xA is being normalized to #xA
>instead of normalizing #xD #x85 to #xA. My patch is just an example for
>scanChar (also fixes normalization of 0x2028). Something similar can be
>done in other places in the scanner.
>
>
>
>
>
>
>
>-----------------------------
>Michael Glavassevich
>mrglavas@engmail.uwaterloo.ca
>4B Computer Engineering
>University of Waterloo
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: xerces-j-dev-unsubscribe@xml.apache.org
>For additional commands, e-mail: xerces-j-dev-help@xml.apache.org
>
>#### D-85-newline-normalization-patch.txt has been removed from this note
>on April 01 2003 by Neil Graham
>#### version-detector-patch.txt has been removed from this note on April 01
>2003 by Neil Graham
>#### attribute-normalization-patch.txt has been removed from this note on
>April 01 2003 by Neil Graham
>
>
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: xerces-j-dev-unsubscribe@xml.apache.org
>For additional commands, e-mail: xerces-j-dev-help@xml.apache.org
>

-----------------------------
Michael Glavassevich
mrglavas@engmail.uwaterloo.ca
4B Computer Engineering
University of Waterloo

---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-j-dev-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-j-dev-help@xml.apache.org


Mime
View raw message