commons-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Thomas Neidhart (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (LANG-935) Possible performance improvement on string escape functions
Date Sat, 14 Mar 2015 09:33:38 GMT

    [ https://issues.apache.org/jira/browse/LANG-935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14361684#comment-14361684
] 

Thomas Neidhart edited comment on LANG-935 at 3/14/15 9:33 AM:
---------------------------------------------------------------

ok my bad, I misread the line.

The problem that I have with this patch is that it tries to optimize a very-specific use-case
(StringEscapeUtils.escapeXXX methods) but may lead to worse performance in other use-cases.

The benchmark is also very limited as it tests only one example of the use of a LookupTranslator.

One specifity of the escapeXXX methods is that they solely escape single characters, thus
we could easily handle this case in the LookupTranslator by caching 1-char translations in
a separate map by character and handle this case differently. The speedup would be the same
as for your solution (I benchmarked it).

Now the LookupTranslator is a public class, thus users might use it to do their own translations.
Imagine one created a LookupTranslator that translates some strings to other strings. With
the patch, the performance might drop if one has put many equally sized strings into the translator
that have the same first character. In this case, all the translations have to be tested all
the time when such a character is encountered.

Edit: on a second thought and looking at the original proposal, I think we should add a LookupTranslator
that solely works with chars (like the CharMapper in the proposal) and use this one in the
escapeXXX methods.


was (Author: tn):
ok my bad, I misread the line.

The problem that I have with this patch is that it tries to optimize a very-specific use-case
(StringEscapeUtils.escapeXXX methods) but may lead to worse performance in other use-cases.

The benchmark is also very limited as it tests only one example of the use of a LookupTranslator.

One specifity of the escapeXXX methods is that they solely escape single characters, thus
we could easily handle this case in the LookupTranslator by caching 1-char translations in
a separate map by character and handle this case differently. The speedup would be the same
as for your solution (I benchmarked it).

Now the LookupTranslator is a public class, thus users might use it to do their own translations.
Imagine one created a LookupTranslator that translates some strings to other strings. With
the patch, the performance might drop if one has put many equally sized strings into the translator
that have the same first character. In this case, all the translations have to be tested all
the time when such a character is encountered.

Edit: on a second though and looking at the original proposal, I think we should add a LookupTranslator
that solely works with chars (like the CharMapper in the proposal) and use this one in the
escapeXXX methods.

> Possible performance improvement on string escape functions
> -----------------------------------------------------------
>
>                 Key: LANG-935
>                 URL: https://issues.apache.org/jira/browse/LANG-935
>             Project: Commons Lang
>          Issue Type: Improvement
>          Components: lang.text.translate.*
>    Affects Versions: 3.1
>            Reporter: Peter Wall
>            Priority: Minor
>              Labels: performance
>             Fix For: Patch Needed
>
>         Attachments: tempproject1.zip
>
>
> The escape functions for HTML etc. use the same code and the same initialisation tables
for the escape and unescape functions, and while this is an elegant approach it leads to a
number of deficiencies:
> 1. The code is very much less efficient than it could be
> 2. A new output string is created even when no conversion is required
> 3. No mapping is provided for characters that do not have a specific representation (for
example HTML 0x101 should become &amp;#257; )
> The proposal is to use a new mapping technique to address these issues



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message