axis-c-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Stadelmann Josef" <josef.stadelm...@axa-winterthur.ch>
Subject AW: axutil_xml_quote_string and apostrophes
Date Tue, 22 Feb 2011 09:27:07 GMT
Yes, there is in fact a reason for that. The restriction is given by the XML standards http://www.w3.org/TR/2008/REC-xml-20081126/#NT-Name
which explaisn the usage of certain characters. 

greater is used in a stream to open a tag, less is used as part of the closing tag, double
quotes are used to name a  tag. 

 

In very short and maybe too simple:

Given the case the parser should read

<statement>40 is < then 70</statement>

 

The parser has a problem. after the opening tag <statement> is read the parser looks
for a closing tag starting with a "<". ON-FOUND it expects  the next character of the closing
tag to come in. 

It finds the "<" and it expects next a "/" but, as it does not find one it struggles and
has to report an error. 

 

As a consequence: "<" can't be used as data in between the opening and closing tag. It
can however be transmitted by using an escaping technology.

 

I suggest you to read "just a bit" about http://www.w3.org/TR/2008/REC-xml-20081126/#NT-Name
to get a better understanding why certain characters can be used in text and other not. However
to send in a xml stream '<' or '>' or '"' there are ways to do so. In this case an 
escaping technique is used.

 

AND READ http://en.wikipedia.org/wiki/Character_encoding because you should never forget encodings
used when parsing or writing xml documents by your own code.

 

 

Excerpt from the documents link given above (see red bold text below first)

 

2.2 Characters

[Definition: A parsed entity contains text, a sequence of characters <http://www.w3.org/TR/2008/REC-xml-20081126/#dt-character>
, which may represent markup or character data.] [Definition: A character is an atomic unit
of text as specified by ISO/IEC 10646:2000 [ISO/IEC 10646] <http://www.w3.org/TR/2008/REC-xml-20081126/#ISO10646>
. Legal characters are tab, carriage return, line feed, and the legal characters of Unicode
and ISO/IEC 10646. The versions of these standards cited in A.1 Normative References <http://www.w3.org/TR/2008/REC-xml-20081126/#sec-existing-stds>
 were current at the time this document was prepared. New characters may be added to these
standards by amendments or new editions. Consequently, XML processors MUST accept any character
in the range specified for Char <http://www.w3.org/TR/2008/REC-xml-20081126/#NT-Char>
. ] 

Character Range

[2]   

Char

   ::=   

#x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]

/* any Unicode character, excluding the surrogate blocks, FFFE, and FFFF. */

The mechanism for encoding character code points into bit patterns may vary from entity to
entity. All XML processors MUST accept the UTF-8 and UTF-16 encodings of Unicode [Unicode]
<http://www.w3.org/TR/2008/REC-xml-20081126/#Unicode> ; the mechanisms for signaling
which of the two is in use, or for bringing other encodings into play, are discussed later,
in 4.3.3 Character Encoding in Entities <http://www.w3.org/TR/2008/REC-xml-20081126/#charencoding>
.

Note:

Document authors are encouraged to avoid "compatibility characters", as defined in section
2.3 of [Unicode] <http://www.w3.org/TR/2008/REC-xml-20081126/#Unicode> . The characters
defined in the following ranges are also discouraged. They are either control characters or
permanently undefined Unicode characters:

[#x7F-#x84], [#x86-#x9F], [#xFDD0-#xFDEF],

[#x1FFFE-#x1FFFF], [#x2FFFE-#x2FFFF], [#x3FFFE-#x3FFFF],

[#x4FFFE-#x4FFFF], [#x5FFFE-#x5FFFF], [#x6FFFE-#x6FFFF],

[#x7FFFE-#x7FFFF], [#x8FFFE-#x8FFFF], [#x9FFFE-#x9FFFF],

[#xAFFFE-#xAFFFF], [#xBFFFE-#xBFFFF], [#xCFFFE-#xCFFFF],

[#xDFFFE-#xDFFFF], [#xEFFFE-#xEFFFF], [#xFFFFE-#xFFFFF],

[#x10FFFE-#x10FFFF].

 

Etc.

 

2.4 Character Data and Markup

Text <http://www.w3.org/TR/2008/REC-xml-20081126/#dt-text>  consists of intermingled
character data <http://www.w3.org/TR/2008/REC-xml-20081126/#dt-chardata>  and markup.
[Definition: Markup takes the form of start-tags <http://www.w3.org/TR/2008/REC-xml-20081126/#dt-stag>
, end-tags <http://www.w3.org/TR/2008/REC-xml-20081126/#dt-etag> , empty-element tags
<http://www.w3.org/TR/2008/REC-xml-20081126/#dt-empty> , entity references <http://www.w3.org/TR/2008/REC-xml-20081126/#dt-entref>
, character references <http://www.w3.org/TR/2008/REC-xml-20081126/#dt-charref> , comments
<http://www.w3.org/TR/2008/REC-xml-20081126/#dt-comment> , CDATA section <http://www.w3.org/TR/2008/REC-xml-20081126/#dt-cdsection>
 delimiters, document type declarations <http://www.w3.org/TR/2008/REC-xml-20081126/#dt-doctype>
, processing instructions <http://www.w3.org/TR/2008/REC-xml-20081126/#dt-pi> , XML
declarations <http://www.w3.org/TR/2008/REC-xml-20081126/#NT-XMLDecl> , text declarations
<http://www.w3.org/TR/2008/REC-xml-20081126/#NT-TextDecl> , and any white space that
is at the top level of the document entity (that is, outside the document element and not
inside any other markup).] 

[Definition: All text that is not markup constitutes the character data of the document.]


The ampersand character (&) and the left angle bracket (<) MUST NOT appear in their
literal form, except when used as markup delimiters, or within a comment <http://www.w3.org/TR/2008/REC-xml-20081126/#dt-comment>
, a processing instruction <http://www.w3.org/TR/2008/REC-xml-20081126/#dt-pi> , or
a CDATA section <http://www.w3.org/TR/2008/REC-xml-20081126/#dt-cdsection> . If they
are needed elsewhere, they MUST be escaped <http://www.w3.org/TR/2008/REC-xml-20081126/#dt-escape>
 using either numeric character references <http://www.w3.org/TR/2008/REC-xml-20081126/#dt-charref>
 or the strings " &amp; " and " &lt; " respectively. The right angle bracket (>)
may be represented using the string " &gt; ", and MUST, for compatibility <http://www.w3.org/TR/2008/REC-xml-20081126/#dt-compat>
, be escaped using either " &gt; " or a character reference when it appears in the string
" ]]> " in content, when that string is not marking the end of a CDATA section <http://www.w3.org/TR/2008/REC-xml-20081126/#dt-cdsection>
.

In the content of elements, character data is any string of characters which does not contain
the start-delimiter of any markup and does not include the CDATA-section-close delimiter,
" ]]> ". In a CDATA section, character data is any string of characters not including the
CDATA-section-close delimiter, " ]]> ".

To allow attribute values to contain both single and double quotes, the apostrophe or single-quote
character (') may be represented as " &apos; ", and the double-quote character (") as
" &quot; ".

Character Data

[14]   

CharData

   ::=   

[^<&]* - ([^<&]* ']]>' [^<&]*)

 

 

Hope that explains a bit, 

and always consider encoding used when the first line in xml stream is specified like:   
  http://www.w3schools.com/xml/singlebyte2.xml

 

Josef

 

 

 

 

 

 

Von: scarleton@gmail.com [mailto:scarleton@gmail.com] Im Auftrag von Sam Carleton
Gesendet: Sonntag, 20. Februar 2011 19:22
An: Apache AXIS C User List
Betreff: axutil_xml_quote_string and apostrophes

 

I just discovered that the axutil_xml_quote_string only escapes the less than, greater than,
and quote, but not the apostrophe.  Is there a reason for this or is it a bug? 

If it is a bug, I would be happy to fix it and submit it back if someone would enlighten me
as to how to do that.

Mime
View raw message