commons-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gary Gregory <garydgreg...@gmail.com>
Subject Re: LANG-728 to work with Lang 3.0 way of using escapeXml with > 0x7f characters [WAS RE: svn commit: r1148162 - /commons/proper/lang/trunk/src/test/java/org/apache/commons/lang3/StringEscapeUtilsTest.java]
Date Tue, 19 Jul 2011 21:18:25 GMT
Hi Hen,

I have more questions than answers...

On Tue, Jul 19, 2011 at 12:35 PM, Henri Yandell <flamefew@gmail.com> wrote:
>
> So you're not saying that we have to escape > 0x7f (old behaviour),

Yeah, the way I read the W3C site, I thought we'd need to escape code
points > 65,536 (above the BMP)

>
> but that we have to escape any supplementary characters?

Yes, in particular, an esacped code point > 65,536 must be escaped
with one escape (&#x233B4; rather than &#xD84C;&#xDFB4;)

The way I read the site is that IF you are going to escape > 65,536,
then you MUST use a single code point value.

What is not clear to me yet is if/when you must escape > 65,536.

The XML 1.0 spec reads:

[2]   Char   ::=   #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] |
[#x10000-#x10FFFF]/* any Unicode character, excluding the surrogate
blocks, FFFE, and FFFF. */

So does that mean that we should make sure we do NOT escape an XML
Char (aside from & > < and so on?)

Then what about XML 1.1?

The XML 1.1 spec reads:

[2]   Char   ::=   [#x1-#xD7FF] | [#xE000-#xFFFD] |
[#x10000-#x10FFFF]/* any Unicode character, excluding the surrogate
blocks, FFFE, and FFFF. */
[2a]   RestrictedChar   ::=   [#x1-#x8] | [#xB-#xC] | [#xE-#x1F] |
[#x7F-#x84] | [#x86-#x9F]

The more I look at this the more it is confusing!

Gary
>
> Hen
>
> On Tue, Jul 19, 2011 at 7:28 AM, Gary Gregory
> <GGregory@seagullsoftware.com> wrote:
> > Hi All:
> >
> > I am glad to know there is a 3.0 way of doing that, which is:
> >
> >    @Test
> >    public void testEscapeXmlSupplementaryCharacters() {
> >        CharSequenceTranslator escapeXml =
> >            StringEscapeUtils.ESCAPE_XML.with( NumericEntityEscaper.between(0x7f,
Integer.MAX_VALUE) );
> >
> >        assertEquals("Supplementary character must be represented using a single
escape", "&#144308;",
> >                escapeXml.translate("\uD84C\uDFB4"));
> >
> >  but what about the test the way it was originally written?
> >
> >        // Example from https://issues.apache.org/jira/browse/LANG-728
> >        assertEquals("Supplementary character must be represented using a single
escape", "&#144308;",
> >                StringEscapeUtils.escapeXml("\uD84C\uDFB4"));
> >        // Example from See http://www.w3.org/International/questions/qa-escapes
> >        assertEquals("Supplementary character must be represented using a single
escape", "&#x233B4;",
> >                StringEscapeUtils.escapeXml("\uD84C;\uDFB4;"));
> >
> > It still fails.
> >
> > Shouldn't the API be changed to work for this case too? The W3C seems to say so:
"you must use the single, code point value for that character" in:
> >
> >     * From http://www.w3.org/International/questions/qa-escapes
> >     * </p>
> >     * <blockquote>
> >     * Supplementary characters are those Unicode characters that have code points
higher than the characters in
> >     * the Basic Multilingual Plane (BMP). In UTF-16 a supplementary character
is encoded using two 16-bit surrogate code points from the
> >     * BMP. Because of this, some people think that supplementary characters need
to be represented using two escapes, but this is incorrect
> >     * – you must use the single, code point value for that character. For example,
use &#x233B4; rather than &#xD84C;&#xDFB4;.
> >     * </blockquote>
> >
> > Gary
> >
> > -----Original Message-----
> > From: bayard@apache.org [mailto:bayard@apache.org]
> > Sent: Tuesday, July 19, 2011 0:58 AM
> > To: commits@commons.apache.org
> > Subject: svn commit: r1148162 - /commons/proper/lang/trunk/src/test/java/org/apache/commons/lang3/StringEscapeUtilsTest.java
> >
> > Author: bayard
> > Date: Tue Jul 19 04:58:03 2011
> > New Revision: 1148162
> >
> > URL: http://svn.apache.org/viewvc?rev=1148162&view=rev
> > Log:
> > Updating unit test for LANG-728 to work with Lang 3.0 way of using escapeXml with
> 0x7f characters
> >
> > Modified:
> >    commons/proper/lang/trunk/src/test/java/org/apache/commons/lang3/StringEscapeUtilsTest.java
> >
> > Modified: commons/proper/lang/trunk/src/test/java/org/apache/commons/lang3/StringEscapeUtilsTest.java
> > URL: http://svn.apache.org/viewvc/commons/proper/lang/trunk/src/test/java/org/apache/commons/lang3/StringEscapeUtilsTest.java?rev=1148162&r1=1148161&r2=1148162&view=diff
> > ==============================================================================
> > --- commons/proper/lang/trunk/src/test/java/org/apache/commons/lang3/StringEscapeUtilsTest.java
(original)
> > +++ commons/proper/lang/trunk/src/test/java/org/apache/commons/lang3/Str
> > +++ ingEscapeUtilsTest.java Tue Jul 19 04:58:03 2011
> > @@ -31,6 +31,9 @@ import org.apache.commons.io.IOUtils;  import org.junit.Ignore;
 import org.junit.Test;
> >
> > +import org.apache.commons.lang3.text.translate.CharSequenceTranslator;
> > +import org.apache.commons.lang3.text.translate.UnicodeEscaper;
> > +
> >  /**
> >  * Unit tests for {@link StringEscapeUtils}.
> >  *
> > @@ -333,15 +336,13 @@ public class StringEscapeUtilsTest {
> >      * @see <a href="http://www.w3.org/International/questions/qa-escapes">Using
character escapes in markup and CSS</a>
> >      * @see <a href="https://issues.apache.org/jira/browse/LANG-728">LANG-728</a>
> >      */
> > -    @Ignore
> >     @Test
> >     public void testEscapeXmlSupplementaryCharacters() {
> > -        // Example from https://issues.apache.org/jira/browse/LANG-728
> > -        assertEquals("Supplementary character must be represented using a single
escape", "&#144308;",
> > -                StringEscapeUtils.escapeXml("\uD84C\uDFB4"));
> > -        // Example from See http://www.w3.org/International/questions/qa-escapes
> > -        assertEquals("Supplementary character must be represented using a single
escape", "&#x233B4;",
> > -                StringEscapeUtils.escapeXml("\uD84C;\uDFB4;"));
> > +        CharSequenceTranslator escapeXml =
> > +            StringEscapeUtils.ESCAPE_XML.with(
> > + UnicodeEscaper.between(0x7f, Integer.MAX_VALUE) );
> > +
> > +        assertEquals("Supplementary character must be represented using a single
escape", "\u233B4",
> > +                escapeXml.translate("\uD84C\uDFB4"));
> >     }
> >
> >     // Tests issue #38569
> >
> >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
> For additional commands, e-mail: dev-help@commons.apache.org
>



--
Thank you,
Gary

http://garygregory.wordpress.com/
http://garygregory.com/
http://people.apache.org/~ggregory/
http://twitter.com/GaryGregory

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org


Mime
View raw message