commons-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Henri Yandell <flame...@gmail.com>
Subject Re: [lang] Fix for Apache2,Apache3 StringEscapeUtils/UnicodeEscaper (3 objects created per non-ascii char) & simpler unicode hex logic
Date Fri, 08 Mar 2013 18:51:49 GMT
Thanks for the report Lawrence.

With regards to Sebb's comment, I'm happy to put in a JIRA, but wanted
to resolve the attribution comment first:

"> Legal:  I, Lawrence Angrave (email address, angrave@illinois.edu
> <mailto:angrave@illinois.edu>), am the author and copyright holder of the
> new code provided in this email, and share and re-license this code under
> the current Apache2 license."

Our solution on attribution of contributions is to add them to the
contributor list (inside the pom.xml in the source and published on
this page http://commons.apache.org/proper/commons-lang/team-list.html).

Can you confirm that's acceptable?

Typically we don't put the mail address in for contributors to reduce
spam hitting them.

Thanks again,

Hen

On Tue, Mar 5, 2013 at 2:28 PM, Lawrence Angrave <angrave@illinois.edu> wrote:
> Hi,
>
> Some comments that are relevant to  Apache3 UnicodeEscaper and Apache2's
> StringEscapeUtils.java
> Summary-
>
>  * I noticed the current Apache code creates three String objects each
>    time it writes a unicode hexadecimal  value.
>  * Apache3 can also create a char[] array per character translation
>    (but I do include a fix for that)
>  * This is a easy-to-fix performance bottleneck when writing many
>    non-ascii characters.
>  * The logic to test for unicode values of different magnitudes can
>    also be simplified.
>  * Benchmark and code fixes for Apache2 and Apache 3 are included. I do
>    not have time to become an Apache maintainer. use or ignore at your
>    choice.
>  * I'm not interested in being a developer for Commons Lang  Use it or
>    not  - that's a choice for Commons Lang developers.
>
> A simple fix more than doubles the string escape speed (40 ms v 100ms to
> translate all unicode characters) for Apache3.
> The older Apache2-style implementation can now translate all unicode
> characters in 8ms.
>
> The existing Apache3/Apache2 write unicode hex values like this-
>         if (codepoint > 0xfff) {
>                 out.write("\\u" + hex(codepoint));
>             } else if (codepoint > 0xff) {
>                 out.write("\\u0" + hex(codepoint));
>             } else if (codepoint > 0xf) {
>                 out.write("\\u00" + hex(codepoint));
>             } else {
>                 out.write("\\u000" + hex(codepoint));
>             }
> The hex() function,
> //hex(): return Integer.toHexString(codepoint).toUpperCase(Locale.ENGLISH);
> also creates two string objects, so we have 3 objects per unicode hex value.
>
> FIX:
>
> The padding logic can be simplified and per-character object creation can be
> eliminated by writing hex digits directly
>             out.write("\\u");
>             out.write(HEX_DIGIT[(codepoint >> 12) & 15]);
>             out.write(HEX_DIGIT[(codepoint >> 8) & 15]);
>             out.write(HEX_DIGIT[(codepoint >> 4) & 15]);
>             out.write(HEX_DIGIT[(codepoint) & 15]);
>
>
> where  HEX_DIGIT  is
> public static final char[] HEX_DIGIT = "0123456789ABCDEF".toCharArray();
> I believe this is safe for all Locales.
>
>
> When benchmarked it was disconcerting that Apache3 is still five times
> slower (40ms instead of 8ms) than my rewritten Apache2 version (included
> below).
> My guess is that there are other unnecessary per-character object creation
> issues still lurking Here's one example -
> CharSequenceTranslator.translate(CharSequence input, Writer out) :
>        char[] c = *Character.toChars*(Character.codePointAt(input, pos))
>
> For better performance this should use toChars(int codePoint,  char[] dst,
> int dstIndex) , which can re-use the dst char array
>
>
>
> The benchmark, my version of a  Apache2-style escapeJavaStyleString
> implementation and the code fix for UnicodeEscaper.java  are included below.
> If you do find this useful, some source-code attribution would be
> appreciated.
> Legal:  I, Lawrence Angrave (email address, angrave@illinois.edu
> <mailto:angrave@illinois.edu>), am the author and copyright holder of the
> new code provided in this email, and share and re-license this code under
> the current Apache2 license.
>
> I hope this email does not go into a blackhole... Feel free to forward it to
> the relevant maintainers.
>
> Regards,
> Lawrence.
>
>     public static final char[] HEX_DIGIT = "0123456789ABCDEF".toCharArray();
>     public static final char[] CONTROL_CHARS; // non-zero entries for
> special case control characters
>     static {
>         CONTROL_CHARS = new char[32];
>         CONTROL_CHARS['\b'] = 'b';
>         CONTROL_CHARS['\n'] = 'n';
>         CONTROL_CHARS['\t'] = 't';
>         CONTROL_CHARS['\f'] = 'f';
>         CONTROL_CHARS['\r'] = 'r';
>     }
>     public static void  escapeJavaStyleString(Writer out, String s, boolean
> escapeSingleQuote) throws IOException {
> // Apache2 makes the following checks, so we will too-
>     if(out==null) throw new IllegalArgumentException("The Writer must not be
> null");
>         if(s == null) return;
>
>         final int len = s.length();
>         for(int i =0; i < len;i++)
>             escapeChar(out,s.charAt(i), escapeSingleQuote);
>     }
>     public static void escapeChar(Writer out, char c, boolean
> escapeSingleQuote)
>             throws IOException {
>         // Most common case
>         if (c >= 32 && c < 127) {
>             if (c == '\\' || c == '"' || (c == '\'' && escapeSingleQuote))
>                 out.write('\\');
>             out.write(c);
>             return;
>         }
>         out.write('\\');
>         if (c < 32 && CONTROL_CHARS[c] != 0) {
>             out.write(CONTROL_CHARS[c]);
>             return;
>         }
>         // Fast 4 digit uppercase hexadecimal without object creation
>         out.write('u');
>         out.write(HEX_DIGIT[(c >> 12) & 15]);
>         out.write(HEX_DIGIT[(c >> 8) & 15]);
>         out.write(HEX_DIGIT[(c >> 4) & 15]);
>         out.write(HEX_DIGIT[(c) & 15]);
>     }
>
>
>
>
> FYI The benchmark test just writes all possible unicode characters into a
> null writer:
>             Writer nullWriter = new Writer() {
>             public void write(String s) {
>             };
>
>             public void write(int c) {
>             }
>
>             public void close() throws IOException {
>             }
>
>             public void flush() throws IOException {
>             }
>
>             public void write(char[] cbuf, int off, int len) throws
> IOException {
>             }
>         };
>         StringBuilder sb = new StringBuilder(0x10000);
>         for (int i = 0; i <= 0xffff; i++)
>             sb.append((char) i);
>         String allChars = sb.toString();
>
>         long t1 = System.currentTimeMillis();
>         StringEscaper.escapeJavaStyleString(nullWriter, allChars, true);
>         long t2 = System.currentTimeMillis();
>         System.out.println(t2 - t1);
>
>         long t3 = System.currentTimeMillis();
>         CharSequenceTranslator translator = StringEscapeUtils.ESCAPE_JAVA;
>         translator.translate(allChars, nullWriter);
>         long t4 = System.currentTimeMillis();
>         System.out.println(t4 - t3);
>
>
>
> The modification to Apache3 UnicodeEscaper :
>
>         if (codepoint > 0xffff) {
>             // TODO: Figure out what to do. Output as two Unicodes?
>             // Does this make this a Java-specific output class?
>             out.write("\\u" + hex(codepoint));
>         } else if (1 == 0) { //*OLD SLOW CODE* (can be removed)
> *if (codepoint > 0xfff) {
>                 out.write("\\u" + hex(codepoint));
>             } else if (codepoint > 0xff) {
>                 out.write("\\u0" + hex(codepoint));
>             } else if (codepoint > 0xf) {
>                 out.write("\\u00" + hex(codepoint));
>             } else {
>                 out.write("\\u000" + hex(codepoint));
>             }*
>         } else { // *NEW FAST CODE*
> *            out.write("\\u");
>             out.write(HEX_DIGIT[(codepoint >> 12) & 15]);
>             out.write(HEX_DIGIT[(codepoint >> 8) & 15]);
>             out.write(HEX_DIGIT[(codepoint >> 4) & 15]);
>             out.write(HEX_DIGIT[(codepoint) & 15]);*
>         }
>
> *and add    public static final char[] HEX_DIGIT =
> "0123456789ABCDEF".toCharArray();**
> *
>
> ps.
> The home page for Commons lang
> http://commons.apache.org/proper/commons-lang/)
>  has the following 404 link on the left
>     Contributing Patches
> http://commons.apache.org/proper/patches.html
> (I believe the 'proper' is unnecessary)
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org


Mime
View raw message