commons-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Henri Yandell <>
Subject [LANG] Rewrite StringEscapeUtils [Was: [jira] Commented: (LANG-505) Rewrite StringEscapeUtils]
Date Mon, 18 May 2009 06:41:33 GMT
So this is more noticeable. I'm starting to play with a rewritten
StringEscapeUtils system. escapeSql is going to get deleted, the
others will sit on top of a modular system. At least that's the
proposal - more in the ticket.

First pass at the code was educational - second pass should be
discussable but thoughts welcome now if anyone has the urge.


On Sun, May 17, 2009 at 11:36 PM, Henri Yandell (JIRA) <> wrote:
>    [
> Henri Yandell commented on LANG-505:
> ------------------------------------
> A generic solution that's pluggable gets a bit interesting - do you assume that a single
unicode codepoint maps to 1..n unicode codepoints, or should you be able to escape multiple
codepoints? Unescaping is n codepoints to 1 codepoint. This makes it harder to have N plugins
all working together. It becomes a translator, unescaping and escaping the same algorithm.
> My first pass was on the 1..n. API is:
> {code:java}
>    // Note it's a FilterWriter, so 'out' is part of the class. Probably change that
and put Writer in the API.
>    public abstract boolean escape(int codepoint) throws IOException;
> {code}
> The boolean is returned to say whether that codepoint was successfully translated or
not. This is needed for AggregateEscapers to work and so codepoint skipping can occur.
> It leads to a top level use of:
> {code:java}
>    public void escapeJava(String input, Writer out) throws IOException {
>        AggregateEscaper escapers = new AggregateEscaper(
>            new EscapeBasedOnLookup(out,
>                      new String[][] {
>                            {"\"", "\\\""},
>                            {"\\", "\\\\"}
>                      }),
>            new EscapeLowAsciiAsUnicode(out),
>            new EscapeNonAsciiAsUnicode(out)
>        );
>        escape(input, escapers);
>    }
> {code}
> It's easy to see from there how a user might modify what they view this to be. They could
write an escapeC quite easily etc. However - it's 1..n, and unescaping isn't handled. The
1..n escape algorithm looks like:
> {code:java}
>    public void escape(String str, CharEscaper escaper) throws IOException {
>        if (str == null) {
>            return;
>        }
>        if (escaper == null) {
>            throw new IllegalArgumentException("The CharEscaper must not be null.
" +
>                                               "Use NullEscaper
if you expected this to mean a no-operation");
>        }
>        int sz = str.length();
>        for (int i = 0; i < sz; i++) {
>            int c = Character.codePointAt(str, i);
>            boolean success = escaper.escape(c);
>            // contract with escapers is that they have to understand codepoints
and they just took care of a surrogate pair
>            if(success && c >= 0x010000 && i < sz - 1) {
>                i++;
>            }
>        }
>    }
> {code}
> As I said - 1..n is the problem. Parsing needs to happen a character at a time, and each
Escaper needs to be offered the choice to take control of the flow. Some options - thinking
out loud:
> * Escapers need to be passed in the equivalent of a C pointer String. A CharSequence
would be good, but calling the subSequence method all the time might not be performant. Probably
best to pass down the index and the CharSequence down. Making the whole thing String rather
than char based - probably better anyway as that saves having to think in terms of codepoints.
> * Escapers are asked whether they want to make a change first, then if they return true
they are called again to make the change and they return an index increment for the driving
loop to make based on the number of characters they consumed.
> So:
> {code:java}
>    public abstract boolean isEscapable(int index, CharSequence input);
>    public abstract int escape(int index, CharSequence input, Writer out) throws IOException;
> {code}
> Needs another code pass. Anyway - I think our current code is screwed up enough to warrant
a deeper implementation.
>> Rewrite StringEscapeUtils
>> -------------------------
>>                 Key: LANG-505
>>                 URL:
>>             Project: Commons Lang
>>          Issue Type: Task
>>            Reporter: Henri Yandell
>>             Fix For: 3.0
>> I think StringEscapeUtils needs a strong rewrite. For each escape method (and unescape)
there tend to be three or four types of escaping happening. So not being able to define which
set of three or four apply is a pain point (and cause of bug reports due to different desired
>> We should be offering basic functionality, but also allowing people to say "escape(Escapers.BASIC_XML,
Escapers.LOW_UNICODE, Escapers.HIGH_UNICODE)".
>> Also should delete escapeSql; it's a bad one imo. Dangerous in that it will lead
people to not use PreparedStatement and given it only escapes ', it's not much use. Especially
as different dialects escape that in different ways.
>> Opening this ticket for discussion.
> --
> This message is automatically generated by JIRA.
> -
> You can reply to this email to add a comment to the issue online.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message