xerces-j-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Neil Graham" <ne...@ca.ibm.com>
Subject Re: XML 1.1 end-of-line oddities and patches
Date Wed, 02 Apr 2003 00:47:56 GMT
Hi Michael,

Nicely done!  The attribute value patch looks fine to me; you're also right
that a grep through the code needs to be undertaken to find other 0x85 0xA
slips.  I'm also unsure that XML11Chars.isXML11Space(int) needs to remain
with us, but there's one use of that method in the XML11EntityScanner that
looks vaguely appropriate; so perhaps it should survive.

I agree with you that the behaviour with respect to spaces in the XML decl
could use some clarifying; either way you're right about there being a bug
here.  I'll think about how to fix this; in the very likely event that no
XML 1.1 newline normalization need be performed here I think we can get
away with a simpler fix:  simply start the version detector with a 1.0
scanner; it can happily throw all the whitespace characters it sees away
and no one need be the wiser.

Neil Graham
XML Parser Development
IBM Toronto Lab
Phone:  905-413-3519, T/L 969-3519
E-mail:  neilg@ca.ibm.com

|         |           Michael Glavassevich|
|         |           <mrglavas@engmail.uw|
|         |           aterloo.ca>         |
|         |                               |
|         |           04/01/2003 12:30 AM |
|         |           Please respond to   |
|         |           xerces-j-dev        |
|         |                               |
  |       To:       xerces-j-dev@xml.apache.org                                          
  |       cc:                                                                            
  |       Subject:  XML 1.1 end-of-line oddities and patches                             

Hi everyone,

I've spent some time looking at the XML 1.1 candidate rec and I've noticed
a number of areas within Xerces which seem to behave contrary to the spec.
The issues that I'm bringing up assume the following:

1) NEL (Unicode 0x85) and LSEP (Unicode 0x2028) are not white space
characters, because the S production (http://www.w3.org/TR/REC-xml#NT-S) is
unchanged from XML 1.0 since there's no modification in the 1.1 spec.

2) The amendment to 2.11 End-Of-Line Handling
(http://www.w3.org/TR/xml11/#sec2.11) means that its possible for 0x85 and
0x2028 to occur in the XML declaration before the version of the document
is determined to be 1.1.

3) There's some way to force non-normalized 0x85 and 0x2028 into a document
using references to paramater/general entities, such that they can appear
in replacement text as part of markup in places where they're not allowed
(since they're not white space). For example, between an element name and
attribute list.
4) Section 3.3.3 Attribute-Value Normalization
(http://www.w3.org/TR/REC-xml#AVNormalize) from XML 1.0 is unchanged, so
even assuming that 0x85 and 0x2028 are white space, they shouldn't be
replaced with the 0x20 space character.

In general it looks like Xerces is treating 0x85 and 0x2028 as if they were
white space everywhere, not just in the case described by section 2.11,
meaning if my third and/or fourth assumptions are correct there are cases
where these characters are going to be handled incorrectly. I've attached
some patches to this e-mail for some specific problems I've located.

Patch #1: version-detector
The parser allows 0x85 and 0x2028 to appear in the XML declaration before
it determines that the version of the document is 1.0. Since end-of-line
handling for XML 1.0 documents doesn't include these characters, such
documents must be invalid. XMLVersionDetector consumes some of the input
stream, so needs to do some clever fixup of the entity before the document
scanner gets a hold of it. Unfortunately once the scanner gets the document
entity, any trace of 0x85 and 0x2028 are gone because they were quitely
normalized away. This makes it impossible for the 1.0 document scanner to
detect 0x85 or 0x2028 in places in the XML declaration. It looks like this
detection must be done in XMLVersionDetector.

My fix first assumes 1.0 end-of-line handling, and then switches to 1.1
end-of-line handling in order to try to match the segment of the XML
declaration production '<?xml' S 'version' S? '=' S?. You can use this
approach to indirectly determine if 0x85 or 0x2028 appears in this part of
the document, and then emit an error if it's determined that the document
wasn't version 1.1.

Patch #2 : attribute-normalization
The parser will replace 0x85 and 0x2028 with 0x20 when normalizing
attributes. As per my fourth assumption this isn't legal. My patch reverts
white space replacement to the behaviour in the 1.0 scanner (just
replacement of 0x20, 0xD, 0xA, 0x9 with 0x20).

Patch #3 : D-85-newline-normalization
It seems like the character sequence #x85 #xA is being normalized to #xA
instead of normalizing #xD #x85 to #xA. My patch is just an example for
scanChar (also fixes normalization of 0x2028). Something similar can be
done in other places in the scanner.

Michael Glavassevich
4B Computer Engineering
University of Waterloo

To unsubscribe, e-mail: xerces-j-dev-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-j-dev-help@xml.apache.org

#### D-85-newline-normalization-patch.txt has been removed from this note
on April 01 2003 by Neil Graham
#### version-detector-patch.txt has been removed from this note on April 01
2003 by Neil Graham
#### attribute-normalization-patch.txt has been removed from this note on
April 01 2003 by Neil Graham

To unsubscribe, e-mail: xerces-j-dev-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-j-dev-help@xml.apache.org

View raw message