commons-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Bernd Eckenfels" <e...@zusammenkunft.net>
Subject Re: [lang] Suggested alternatives for escape functions
Date Tue, 10 Dec 2013 16:39:06 GMT
Hello,

it depends on what you want to escape, a single Unicode character could be  
2 codepoints (UTF-16 codepoints can only cover the BMP). So having a  
String typed needle can be helpfull. But of course all the usual things  
are single-codepoint characters (<>&"...). Having said that, any reason  
why CharMappter takes an integer not a char? Thats missleading in this  
context if someone expects it to be a real codepoint - which it is not  
(using charAt()).

Besides that, the implementation copies single characters to the new  
StringBuffer and produces multiple String buffers in a look without  
guessing the initial lengt. That does not look like a efficient  
implementation to the problem to me. Not sure where I have seen the  
functions which handle that, maybe in one of the xml parsers.

BTW: maybe also the input should be a CharSequence not a String?

Greetings
Bernd

Am 10.12.2013, 05:14 Uhr, schrieb Peter Wall <pwall@pwall.net>:

> Hi, I'm new here, so please forgive me if I'm duplicating a previous  
> discussion (I looked back through several months of archives for  
> something related, before suffering a near-fatal attack of tl;dr).
>
> I have a toolbox of functions that I have accumulated over the years and  
> among them are "escape" functions for converting, for example, XML "&"  
> to "&amp;" etc.  When I showed these to a colleague he asked why I  
> didn't use the Apache Commons utilities, so I benchmarked my functions  
> against the Commons versions and found that mine were approximately 10  
> times faster.  At which point the same colleague suggested submitting my  
> versions to Apache, so here goes.
>
> The code in org.apache.commons.lang3.text.translate is very elegant in  
> the way it uses the same code and the same initialisation character  
> arrays for both the escape and the unescape functions, but this elegance  
> comes at a cost.  The unescape will need to look up multi-character  
> sequences, but the escape code will ALWAYS be looking up single  
> characters, and this can be made much simpler than a string match.  And  
> in my view the function should never allocate a new object until it  
> finds that it needs to do so - in many cases the string will not need to  
> be modified at all so the original string should be returned.
>
> The escape function is:
>
>      public static final String escape(String s, CharMapper mapper) {
>          for (int i = 0, n = s.length(); i < n; ) {
>              char ch = s.charAt(i++);
>              String mapped = mapper.map(ch);
>              if (mapped != null) {
>                  StringBuilder sb = new StringBuilder();
>                  for (int j = 0, k = i - 1; j < k; ++j)
>                      sb.append(s.charAt(j));
>                  sb.append(mapped);
>                  while (i < n) {
>                      ch = s.charAt(i++);
>                      mapped = mapper.map(ch);
>                      if (mapped != null)
>                          sb.append(mapped);
>                      else
>                          sb.append(ch);
>                  }
>                  return sb.toString();
>              }
>          }
>          return s;
>      }
>
> Where CharMapper is:
>
>      public interface CharMapper {
>          String map(int codePoint);
>      }
>
> and the implementation for XML is:
>
>      private static final CharMapper allCharMapper = new CharMapper() {
>          @Override
>          public String map(int codePoint) {
>              if (codePoint == '<')
>                  return "&lt;";
>              if (codePoint == '>')
>                  return "&gt;";
>              if (codePoint == '&')
>                  return "&amp;";
>              if (codePoint == '"')
>                  return "&quot;";
>              if (codePoint == '\'')
>                  return "&apos;";
>              if (codePoint < ' ' && !isWhiteSpace(codePoint) ||  
> codePoint >= 0x7F) {
>                  // isWhitespace checks for XML whitespace characters,  
> \n \r etc.
>                  StringBuilder sb = new StringBuilder(10);
>                  sb.append("&#");
>                  sb.append(codePoint);
>                  sb.append(';');
>                  return sb.toString();
>              }
>              return null;
>          }
>      };
>
> The whole thing can be wrapped in a simple function like:
>
>      public static String escapeAll(String s) {
>          return escape(s, allCharMapper);
>      }
>
> I have versions for Java string escapes, XML, HTML (including the full  
> range of entity names) and URI percent encoding, and I have versions  
> that handle UTF-16 surrogate codes.  They all perform approxiamtely an  
> order of magnitude better than the existing Apache Commons functons.   
> They are currently under LGPL and I have JUnit tests for all of them.
>
> One thing to note is that my versions convert all characters over 0x7F  
> to numeric character references, thus sidestepping any concerns over  
> UTF-8 or ISO-8859-1 character set encoding.
>
> Is anyone interested?
>
> Regards,
> Peter Wall
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
> For additional commands, e-mail: dev-help@commons.apache.org
>


-- 
http://www.zusammenkunft.net

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org


Mime
View raw message