axis-c-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sam Carleton <scarle...@miltonstreet.com>
Subject Re: axutil_xml_quote_string and apostrophes
Date Wed, 23 Feb 2011 16:23:31 GMT
Josef,

Your reply gives me the impression you expect me to become an expert in
XML.  What drives me nuts about open source is that the community seems to
expect everyone to be an expert in everything, I simply don't have time to
learn every last detail of every little tool I use in the world when my goal
is developing application, not building tools.

I guess I will run with what I got and hope new issues don't come up.  So
far all seems to work well.

Sam

On Tue, Feb 22, 2011 at 4:27 AM, Stadelmann Josef <
josef.stadelmann@axa-winterthur.ch> wrote:

> Yes, there is in fact a reason for that. The restriction is given by the
> XML standards http://www.w3.org/TR/2008/REC-xml-20081126/#NT-Name which
> explaisn the usage of certain characters.
>
> greater is used in a stream to open a tag, less is used as part of the
> closing tag, double quotes are used to name a  tag.
>
>
>
> In very short and maybe too simple:
>
> Given the case the parser should read
>
> <statement>40 is < then 70</statement>
>
>
>
> The parser has a problem. after the opening tag <statement> is read the
> parser looks for a closing tag starting with a "<". ON-FOUND it expects  the
> next character of the closing tag to come in.
>
> It finds the "<" and it expects next a "/" but, as it does not find one it
> struggles and has to report an error.
>
>
>
> As a consequence: "<" can't be used as data in between the opening and
> closing tag. It can however be transmitted by using an escaping technology.
>
>
>
> I suggest you to read "just a bit" about
> http://www.w3.org/TR/2008/REC-xml-20081126/#NT-Name to get a better
> understanding why certain characters can be used in text and other not.
> However to send in a xml stream '<' or '>' or '"' there are ways to do so.
> In this case an  escaping technique is used.
>
>
>
> AND READ http://en.wikipedia.org/wiki/Character_encoding because you
> should never forget encodings used when parsing or writing xml documents by
> your own code.
>
>
>
>
>
> *Excerpt from the documents link given above **(see red bold text below
> first)***
>
> * *
>
> *2.2 Characters*
>
> [Definition: A parsed entity contains *text*, a sequence of characters<http://www.w3.org/TR/2008/REC-xml-20081126/#dt-character>,
> which may represent markup or character data.] [Definition: A *character*is an atomic
unit of text as specified by ISO/IEC 10646:2000 [ISO/IEC
> 10646] <http://www.w3.org/TR/2008/REC-xml-20081126/#ISO10646>. Legal
> characters are tab, carriage return, line feed, and the legal characters of
> Unicode and ISO/IEC 10646. The versions of these standards cited in *A.1
> Normative References*<http://www.w3.org/TR/2008/REC-xml-20081126/#sec-existing-stds>were
current at the time this document was prepared. New characters may be
> added to these standards by amendments or new editions. Consequently, XML
> processors MUST accept any character in the range specified for Char<http://www.w3.org/TR/2008/REC-xml-20081126/#NT-Char>.
> ]
>
> *Character Range*
>
> [2]
>
> Char
>
>    ::=
>
> #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]
>
> */* any Unicode character, excluding the surrogate blocks, FFFE, and FFFF.
> ***/*
>
> The mechanism for encoding character code points into bit patterns may vary
> from entity to entity. All XML processors MUST accept the UTF-8 and UTF-16
> encodings of Unicode [Unicode]<http://www.w3.org/TR/2008/REC-xml-20081126/#Unicode>;
> the mechanisms for signaling which of the two is in use, or for bringing
> other encodings into play, are discussed later, in *4.3.3 Character
> Encoding in Entities*<http://www.w3.org/TR/2008/REC-xml-20081126/#charencoding>
> .
>
> *Note:*
>
> Document authors are encouraged to avoid "compatibility characters", as
> defined in section 2.3 of [Unicode]<http://www.w3.org/TR/2008/REC-xml-20081126/#Unicode>.
> The characters defined in the following ranges are also discouraged. They
> are either control characters or permanently undefined Unicode characters:
>
> [#x7F-#x84], [#x86-#x9F], [#xFDD0-#xFDEF],
>
> [#x1FFFE-#x1FFFF], [#x2FFFE-#x2FFFF], [#x3FFFE-#x3FFFF],
>
> [#x4FFFE-#x4FFFF], [#x5FFFE-#x5FFFF], [#x6FFFE-#x6FFFF],
>
> [#x7FFFE-#x7FFFF], [#x8FFFE-#x8FFFF], [#x9FFFE-#x9FFFF],
>
> [#xAFFFE-#xAFFFF], [#xBFFFE-#xBFFFF], [#xCFFFE-#xCFFFF],
>
> [#xDFFFE-#xDFFFF], [#xEFFFE-#xEFFFF], [#xFFFFE-#xFFFFF],
>
> [#x10FFFE-#x10FFFF].
>
>
>
> Etc.
>
>
>
> *2.4 Character Data and Markup*
>
> Text <http://www.w3.org/TR/2008/REC-xml-20081126/#dt-text> consists of
> intermingled character data<http://www.w3.org/TR/2008/REC-xml-20081126/#dt-chardata>and
markup. [
> Definition: *Markup* takes the form of *start-tags<http://www.w3.org/TR/2008/REC-xml-20081126/#dt-stag>,
> end-tags <http://www.w3.org/TR/2008/REC-xml-20081126/#dt-etag>,* empty-element
> tags <http://www.w3.org/TR/2008/REC-xml-20081126/#dt-empty>, entity
> references <http://www.w3.org/TR/2008/REC-xml-20081126/#dt-entref>, character
> references <http://www.w3.org/TR/2008/REC-xml-20081126/#dt-charref>,
> comments <http://www.w3.org/TR/2008/REC-xml-20081126/#dt-comment>, CDATA
> section <http://www.w3.org/TR/2008/REC-xml-20081126/#dt-cdsection>delimiters, document
> type declarations <http://www.w3.org/TR/2008/REC-xml-20081126/#dt-doctype>,
> processing instructions<http://www.w3.org/TR/2008/REC-xml-20081126/#dt-pi>,
> XML declarations <http://www.w3.org/TR/2008/REC-xml-20081126/#NT-XMLDecl>,
> text declarations<http://www.w3.org/TR/2008/REC-xml-20081126/#NT-TextDecl>,
> and any white space that is at the top level of the document entity (that
> is, outside the document element and not inside any other markup).]
>
> [Definition: All text that is not markup constitutes the *character data*of the document.]
>
> *The ampersand character (&) and the left angle bracket (<) MUST NOT
> appear in their literal form, except when used as markup delimiters, or
> within a comment <http://www.w3.org/TR/2008/REC-xml-20081126/#dt-comment>,
> a processing instruction<http://www.w3.org/TR/2008/REC-xml-20081126/#dt-pi>,
> or a CDATA section<http://www.w3.org/TR/2008/REC-xml-20081126/#dt-cdsection>
> .* *If they are needed elsewhere, they MUST be escaped<http://www.w3.org/TR/2008/REC-xml-20081126/#dt-escape>using
either numeric
> character references<http://www.w3.org/TR/2008/REC-xml-20081126/#dt-charref>or
the strings " &amp; " and " &lt; " respectively.
> * The right angle bracket (>) may be represented using the string " &gt;
> ", and MUST, for compatibility<http://www.w3.org/TR/2008/REC-xml-20081126/#dt-compat>,
> be escaped using either " &gt; " or a character reference when it appears in
> the string " ]]> " in content, when that string is not marking the end of a CDATA
> section <http://www.w3.org/TR/2008/REC-xml-20081126/#dt-cdsection>.
>
> In the content of elements, character data is any string of characters
> which does not contain the start-delimiter of any markup and does not
> include the CDATA-section-close delimiter, " ]]> ". In a CDATA section,
> character data is any string of characters not including the
> CDATA-section-close delimiter, " ]]> ".
>
> To allow attribute values to contain both single and double quotes, the
> apostrophe or single-quote character (') may be represented as " &apos; ",
> and the double-quote character (") as " &quot; ".
>
> *Character Data*
>
> [14]
>
> CharData
>
>    ::=
>
> [^<&]* - ([^<&]* ']]>' [^<&]*)
>
>
>
>
>
> Hope that explains a bit,
>
> and always consider encoding used when the first line in xml stream is
> specified like:      http://www.w3schools.com/xml/singlebyte2.xml
>
>
>
> Josef
>
>
>
>
>
>
>
>
>
>
>
>
>
> *Von:* scarleton@gmail.com [mailto:scarleton@gmail.com] *Im Auftrag von *Sam
> Carleton
> *Gesendet:* Sonntag, 20. Februar 2011 19:22
> *An:* Apache AXIS C User List
> *Betreff:* axutil_xml_quote_string and apostrophes
>
>
>
> I just discovered that the axutil_xml_quote_string only escapes the less
> than, greater than, and quote, but not the apostrophe.  Is there a reason
> for this or is it a bug?
>
> If it is a bug, I would be happy to fix it and submit it back if someone
> would enlighten me as to how to do that.
>

Mime
View raw message