commons-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Peter Wall <pw...@pwall.net>
Subject Re: [lang] Suggested alternatives for escape functions
Date Wed, 11 Dec 2013 00:30:34 GMT
Hi Bernd,

Thank you for taking the time to look at my submission.  Let me see if 
I can answer your comments:

1.  I have a separate version (which I did not include in my original 
email; I thought it was already long enough) which handles UTF-16 
strings, that is, strings which could include Unicode surrogate 
sequences:

     public static final String escapeUTF16(String s, CharMapper mapper) 
{
         char ch1 = '\0', ch2 = '\0'; // avoid "possibly uninitialised" 
errors
         for (int i = 0, n = s.length(); i < n; ) {
             int k = i;
             ch1 = s.charAt(i++);
             String mapped;
             if (Character.isHighSurrogate(ch1)) {
                 if (i >= n || !Character.isLowSurrogate(ch2 = 
s.charAt(i++)))
                     throw new IllegalArgumentException("Illegal 
surrogate sequence");
                 mapped = mapper.map(Character.toCodePoint(ch1, ch2));
             }
             else
                 mapped = mapper.map(ch1);
             if (mapped != null) {
                 StringBuilder sb = new StringBuilder();
                 for (int j = 0; j < k; ++j)
                     sb.append(s.charAt(j));
                 sb.append(mapped);
                 while (i < n) {
                     ch1 = s.charAt(i++);
                     if (Character.isHighSurrogate(ch1)) {
                         if (i >= n || !Character.isLowSurrogate(ch2 = 
s.charAt(i++)))
                             throw new IllegalArgumentException("Illegal 
surrogate sequence");
                         mapped = mapper.map(Character.toCodePoint(ch1, 
ch2));
                     }
                     else
                         mapped = mapper.map(ch1);
                     if (mapped != null)
                         sb.append(mapped);
                     else if (Character.isHighSurrogate(ch1))
                         sb.append(ch1).append(ch2);
                     else
                         sb.append(ch1);
                 }
                 return sb.toString();
             }
         }
         return s;
     }

As you can see, this uses the same CharMapper, and in this case it is 
called with a full Unicode code point.  Whether to throw an exception or 
simply to process the characters anyway in the case of an erroneous 
surrogate sequence is a matter of debate; I have chosen the former in 
this case but I could be persuaded otherwise.

2.  In different iterations of this code I have attempted to estimate 
the output length and pre-allocate the StringBuilder, but estimates are 
difficult.  My most recent attempt used double the input string length, 
but for a 2-character string, where both characters convert to 
8-character sequences, this would be worse than the StringBuilder 
default (of 16).  Perhaps double the input string length plus 20 would 
be a good estimate.  I'm happy to take suggestions on this point.

3.  I have a separate version of escape (and escapeUTF16) which takes a 
CharSequence and returns a CharSequence as output (in line with my 
principle of returning the input object unmodified if it needs no 
conversion).  The code is identical except that 'return sb.toString();' 
becomes 'return sb;'.  I realise that calling toString() on a String 
would return 'this' so there would be no unnecessary object allocation 
if I were to take a CharSequence as input and return a String.  Again, I 
am happy to take suggestions.

Regards,
Peter


On 2013-12-11 03:39, Bernd Eckenfels wrote:
> Hello,
>
> it depends on what you want to escape, a single Unicode character
> could be  2 codepoints (UTF-16 codepoints can only cover the BMP). So
> having a  String typed needle can be helpfull. But of course all the
> usual things  are single-codepoint characters (<>&"...). Having said
> that, any reason  why CharMappter takes an integer not a char? Thats
> missleading in this  context if someone expects it to be a real
> codepoint - which it is not  (using charAt()).
>
> Besides that, the implementation copies single characters to the new
> StringBuffer and produces multiple String buffers in a look without
> guessing the initial lengt. That does not look like a efficient
> implementation to the problem to me. Not sure where I have seen the
> functions which handle that, maybe in one of the xml parsers.
>
> BTW: maybe also the input should be a CharSequence not a String?
>
> Greetings
> Bernd
>
> Am 10.12.2013, 05:14 Uhr, schrieb Peter Wall <pwall@pwall.net>:
>
>> Hi, I'm new here, so please forgive me if I'm duplicating a previous 
>> discussion (I looked back through several months of archives for  
>> something related, before suffering a near-fatal attack of tl;dr).
>>
>> I have a toolbox of functions that I have accumulated over the years 
>> and  among them are "escape" functions for converting, for example, 
>> XML "&"  to "&amp;" etc.  When I showed these to a colleague he asked 
>> why I  didn't use the Apache Commons utilities, so I benchmarked my 
>> functions  against the Commons versions and found that mine were 
>> approximately 10  times faster.  At which point the same colleague 
>> suggested submitting my  versions to Apache, so here goes.
>>
>> The code in org.apache.commons.lang3.text.translate is very elegant 
>> in  the way it uses the same code and the same initialisation 
>> character  arrays for both the escape and the unescape functions, but 
>> this elegance  comes at a cost.  The unescape will need to look up 
>> multi-character  sequences, but the escape code will ALWAYS be looking 
>> up single  characters, and this can be made much simpler than a string 
>> match.  And  in my view the function should never allocate a new 
>> object until it  finds that it needs to do so - in many cases the 
>> string will not need to  be modified at all so the original string 
>> should be returned.
>>
>> The escape function is:
>>
>>      public static final String escape(String s, CharMapper mapper) 
>> {
>>          for (int i = 0, n = s.length(); i < n; ) {
>>              char ch = s.charAt(i++);
>>              String mapped = mapper.map(ch);
>>              if (mapped != null) {
>>                  StringBuilder sb = new StringBuilder();
>>                  for (int j = 0, k = i - 1; j < k; ++j)
>>                      sb.append(s.charAt(j));
>>                  sb.append(mapped);
>>                  while (i < n) {
>>                      ch = s.charAt(i++);
>>                      mapped = mapper.map(ch);
>>                      if (mapped != null)
>>                          sb.append(mapped);
>>                      else
>>                          sb.append(ch);
>>                  }
>>                  return sb.toString();
>>              }
>>          }
>>          return s;
>>      }
>>
>> Where CharMapper is:
>>
>>      public interface CharMapper {
>>          String map(int codePoint);
>>      }
>>
>> and the implementation for XML is:
>>
>>      private static final CharMapper allCharMapper = new 
>> CharMapper() {
>>          @Override
>>          public String map(int codePoint) {
>>              if (codePoint == '<')
>>                  return "&lt;";
>>              if (codePoint == '>')
>>                  return "&gt;";
>>              if (codePoint == '&')
>>                  return "&amp;";
>>              if (codePoint == '"')
>>                  return "&quot;";
>>              if (codePoint == '\'')
>>                  return "&apos;";
>>              if (codePoint < ' ' && !isWhiteSpace(codePoint) ||  
>> codePoint >= 0x7F) {
>>                  // isWhitespace checks for XML whitespace 
>> characters,  \n \r etc.
>>                  StringBuilder sb = new StringBuilder(10);
>>                  sb.append("&#");
>>                  sb.append(codePoint);
>>                  sb.append(';');
>>                  return sb.toString();
>>              }
>>              return null;
>>          }
>>      };
>>
>> The whole thing can be wrapped in a simple function like:
>>
>>      public static String escapeAll(String s) {
>>          return escape(s, allCharMapper);
>>      }
>>
>> I have versions for Java string escapes, XML, HTML (including the 
>> full  range of entity names) and URI percent encoding, and I have 
>> versions  that handle UTF-16 surrogate codes.  They all perform 
>> approxiamtely an  order of magnitude better than the existing Apache 
>> Commons functons.   They are currently under LGPL and I have JUnit 
>> tests for all of them.
>>
>> One thing to note is that my versions convert all characters over 
>> 0x7F  to numeric character references, thus sidestepping any concerns 
>> over  UTF-8 or ISO-8859-1 character set encoding.
>>
>> Is anyone interested?
>>
>> Regards,
>> Peter Wall
>>
>>
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
>> For additional commands, e-mail: dev-help@commons.apache.org
>>


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org


Mime
View raw message