commons-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (LANG-877) Performance improvements for StringEscapeUtils
Date Fri, 13 Mar 2015 09:40:38 GMT

    [ https://issues.apache.org/jira/browse/LANG-877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14360145#comment-14360145
] 

ASF GitHub Bot commented on LANG-877:
-------------------------------------

GitHub user CodingFabian opened a pull request:

    https://github.com/apache/commons-lang/pull/49

    LANG-877 removes unnecessary string allocation and improves hex writing.

    Removes temporary allocation of char arrays and Strings from unicode escaping.
    This roughly doubles the throughput of the escaping functionality mentioned in
    LANG-877. As a side effect this significantly reduces garbage (for the cases
    where the JVM does not allocate the char arrays / String on stack).
    
    Note that there is a minor duplication / inconsistency which I uncovered.
    CharUtils will use lowercase letters, UnicodeEscaper uppercase.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/CodingFabian/commons-lang LANG-877

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/commons-lang/pull/49.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #49
    
----
commit eaaea2aab23e752841340d44aa41e846dea0a7b9
Author: Fabian Lange <lange.fabian@gmail.com>
Date:   2015-03-13T09:34:26Z

    LANG-877 removes unnecessary string allocation and improves hex writing.
    
    Removes temporary allocation of char arrays and Strings from unicode escaping.
    This roughly doubles the throughput of the escaping functionality mentioned in
    LANG-877. As a side effect this significantly reduces garbage (for the cases
    where the JVM does not allocate the char arrays / String on stack).
    
    Note that there is a minor duplication / inconsistency which I uncovered.
    CharUtils will use lowercase letters, UnicodeEscaper uppercase.

----


> Performance improvements for StringEscapeUtils
> ----------------------------------------------
>
>                 Key: LANG-877
>                 URL: https://issues.apache.org/jira/browse/LANG-877
>             Project: Commons Lang
>          Issue Type: Improvement
>          Components: lang.text.translate.*
>    Affects Versions: 3.1
>            Reporter: Henri Yandell
>             Fix For: Patch Needed
>
>
> An email on the list from Lawrence Angrave:
> Hi,
> Some comments that are relevant to  Apache3 UnicodeEscaper and Apache2's StringEscapeUtils.java
> Summary-
>  * I noticed the current Apache code creates three String objects each
>    time it writes a unicode hexadecimal  value.
>  * Apache3 can also create a char[] array per character translation
>    (but I do not include a fix for that)
>  * This is a easy-to-fix performance bottleneck when writing many
>    non-ascii characters.
>  * The logic to test for unicode values of different magnitudes can
>    also be simplified.
>  * Benchmark and code fixes for Apache2 and Apache 3 are included. I do
>    not have time to become an Apache maintainer. use or ignore at your
>    choice.
>  * I'm not interested in being a developer for Commons Lang  Use it or
>    not  - that's a choice for Commons Lang developers.
> A simple fix more than doubles the string escape speed (40 ms v 100ms to translate all
unicode characters) for Apache3.
> The older Apache2-style implementation can now translate all unicode characters in 8ms.
> The existing Apache3/Apache2 write unicode hex values like this-
> {code}
>         if (codepoint > 0xfff) {
>                 out.write("\\u" + hex(codepoint));
>             } else if (codepoint > 0xff) {
>                 out.write("\\u0" + hex(codepoint));
>             } else if (codepoint > 0xf) {
>                 out.write("\\u00" + hex(codepoint));
>             } else {
>                 out.write("\\u000" + hex(codepoint));
>             }
> {code}
> The hex() function,
> {code}
> //hex(): return Integer.toHexString(codepoint).toUpperCase(Locale.ENGLISH);
> {code}
> also creates two string objects, so we have 3 objects per unicode hex value.
> FIX:
> The padding logic can be simplified and per-character object creation can be eliminated
by writing hex digits directly
> {code}
>             out.write("\\u");
>             out.write(HEX_DIGIT[(codepoint >> 12) & 15]);
>             out.write(HEX_DIGIT[(codepoint >> 8) & 15]);
>             out.write(HEX_DIGIT[(codepoint >> 4) & 15]);
>             out.write(HEX_DIGIT[(codepoint) & 15]);
> {code}
> where  HEX_DIGIT  is
> {code}
> public static final char[] HEX_DIGIT = "0123456789ABCDEF".toCharArray();
> {code}
> I believe this is safe for all Locales.
> When benchmarked it was disconcerting that Apache3 is still five times slower (40ms instead
of 8ms) than my rewritten Apache2 version (included below).
> My guess is that there are other unnecessary per-character object creation issues still
lurking Here's one example -
> {code}
> CharSequenceTranslator.translate(CharSequence input, Writer out) :
>        char[] c = *Character.toChars*(Character.codePointAt(input, pos))
> {code}
> For better performance this should use {{toChars(int codePoint,  char[] dst, int dstIndex)}}
, which can re-use the dst char array
> The benchmark, my version of a  Apache2-style escapeJavaStyleString implementation and
the code fix for UnicodeEscaper.java  are included below.
> I hope this email does not go into a blackhole... Feel free to forward it to the relevant
maintainers.
> Regards,
> Lawrence.
> {code}
>     public static final char[] HEX_DIGIT = "0123456789ABCDEF".toCharArray();
>     public static final char[] CONTROL_CHARS; // non-zero entries for special case control
characters
>     static {
>         CONTROL_CHARS = new char[32];
>         CONTROL_CHARS['\b'] = 'b';
>         CONTROL_CHARS['\n'] = 'n';
>         CONTROL_CHARS['\t'] = 't';
>         CONTROL_CHARS['\f'] = 'f';
>         CONTROL_CHARS['\r'] = 'r';
>     }
>     public static void  escapeJavaStyleString(Writer out, String s, boolean escapeSingleQuote)
throws IOException {
> // Apache2 makes the following checks, so we will too-
>     if(out==null) throw new IllegalArgumentException("The Writer must not be null");
>         if(s == null) return;
>         final int len = s.length();
>         for(int i =0; i < len;i++)
>             escapeChar(out,s.charAt(i), escapeSingleQuote);
>     }
>     public static void escapeChar(Writer out, char c, boolean escapeSingleQuote)
>             throws IOException {
>         // Most common case
>         if (c >= 32 && c < 127) {
>             if (c == '\\' || c == '"' || (c == '\'' && escapeSingleQuote))
>                 out.write('\\');
>             out.write(c);
>             return;
>         }
>         out.write('\\');
>         if (c < 32 && CONTROL_CHARS[c] != 0) {
>             out.write(CONTROL_CHARS[c]);
>             return;
>         }
>         // Fast 4 digit uppercase hexadecimal without object creation
>         out.write('u');
>         out.write(HEX_DIGIT[(c >> 12) & 15]);
>         out.write(HEX_DIGIT[(c >> 8) & 15]);
>         out.write(HEX_DIGIT[(c >> 4) & 15]);
>         out.write(HEX_DIGIT[(c) & 15]);
>     }
> {code}
> FYI The benchmark test just writes all possible unicode characters into a null writer:
> {code}
>             Writer nullWriter = new Writer() {
>             public void write(String s) {
>             };
>             public void write(int c) {
>             }
>             public void close() throws IOException {
>             }
>             public void flush() throws IOException {
>             }
>             public void write(char[] cbuf, int off, int len) throws IOException {
>             }
>         };
>         StringBuilder sb = new StringBuilder(0x10000);
>         for (int i = 0; i <= 0xffff; i++)
>             sb.append((char) i);
>         String allChars = sb.toString();
>         long t1 = System.currentTimeMillis();
>         StringEscaper.escapeJavaStyleString(nullWriter, allChars, true);
>         long t2 = System.currentTimeMillis();
>         System.out.println(t2 - t1);
>         long t3 = System.currentTimeMillis();
>         CharSequenceTranslator translator = StringEscapeUtils.ESCAPE_JAVA;
>         translator.translate(allChars, nullWriter);
>         long t4 = System.currentTimeMillis();
>         System.out.println(t4 - t3);
> {code}
> The modification to Apache3 UnicodeEscaper :
> {code}
>         if (codepoint > 0xffff) {
>             // TODO: Figure out what to do. Output as two Unicodes?
>             // Does this make this a Java-specific output class?
>             out.write("\\u" + hex(codepoint));
>         } else if (1 == 0) { //*OLD SLOW CODE* (can be removed)
> *if (codepoint > 0xfff) {
>                 out.write("\\u" + hex(codepoint));
>             } else if (codepoint > 0xff) {
>                 out.write("\\u0" + hex(codepoint));
>             } else if (codepoint > 0xf) {
>                 out.write("\\u00" + hex(codepoint));
>             } else {
>                 out.write("\\u000" + hex(codepoint));
>             }*
>         } else { // *NEW FAST CODE*
> *            out.write("\\u");
>             out.write(HEX_DIGIT[(codepoint >> 12) & 15]);
>             out.write(HEX_DIGIT[(codepoint >> 8) & 15]);
>             out.write(HEX_DIGIT[(codepoint >> 4) & 15]);
>             out.write(HEX_DIGIT[(codepoint) & 15]);*
>         }
> *and add    public static final char[] HEX_DIGIT = "0123456789ABCDEF".toCharArray();**
> *
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message