commons-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (JIRA)" <>
Subject [jira] [Commented] (LANG-1206) CharSetUtils.squeeze() Performance Issue
Date Fri, 20 May 2016 18:05:12 GMT


ASF GitHub Bot commented on LANG-1206:

GitHub user PascalSchumacher opened a pull request:

    LANG-1206: improve CharSetUtils.squeeze() performace

    patch by Mohammed Alfallaj

You can merge this pull request into a Git repository by running:

    $ git pull CharSetUtils#squeeze

Alternatively you can review and apply these changes as the patch at:

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #147
commit 8c87ace5492a7093e93723571d95aeab47539166
Author: pascalschumacher <>
Date:   2016-05-20T18:04:30Z

    LANG-1206: improve CharSetUtils.squeeze() performace
    patch by Mohammed Alfallaj


> CharSetUtils.squeeze() Performance Issue
> ----------------------------------------
>                 Key: LANG-1206
>                 URL:
>             Project: Commons Lang
>          Issue Type: Improvement
>          Components: lang.*
>    Affects Versions: 3.4
>            Reporter: Mohammed Alfallaj
>              Labels: patch, performance
>             Fix For: 3.4
>         Attachments: CharSetUtils_squeeze_performance_issue.patch, Description_Squeeze.docx
> Before reading the description below, please note that there was an issue created in
10/Jul/2011 titled "CharSetUtils.squeeze() speedup" LANG-715. This solution does not prevent
the repetitive calls to the expensive contains() method explained below. In addition, I updated
this to (major) because of the high CPU cost of the current solution.
> A readable form of the description is attached in Description_Squeeze.docx.
> Below is a text format of the description (in case the .docx does not work):
> ***********************************************************************
> Performance problems of current CharSetUtils.squeeze() algorithm:
> ***********************************************************************
> ************************************
> ************************************
> -Method org.apache.commons.lang3.CharSet.contains() is called in each character iteration
for a consecutive set of duplicated characters in str (starting from the second character
in the duplicated set). This large number of calls to contains() are determined to increase
the performance overhead for squeeze() calls given the high computing cost of executing contains()
method. Note that only one call to contains() is needed for the second character in this set
of duplicated characters. There is no need to repeat the calls to contains() method for the
same duplicated character that we have already checked in the second occurrence. For example,
if str=”abbbbbc” and set=”b”, then the algorithm should call contains() only once
for the second b character occurrence and prevent calling contains() again given that it has
already been called before for this block. The current algorithm adds high performance overhead
by calling contains() repetitively for each duplicated b character starting from the second
occurrence. For this example where str=”abbbbbc” and set=”b”, the suggested solution
will call contains() method only once instead of 4 times which will largely reduce the performance
overhead of squeeze() method. 
> -Consider the following example, System.currentTimeMillis() testing shows a performance
improvement by around 83% if we run the suggested enhanced solution with the following sample
input data 
> 	str = “abbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbc”
> 	set = “b”
> ************************************
> ************************************
> -Using two character flags inChars and notInChars:
> 	inChars: flag to record the recent character that has been determined to be in chars
> 	notInChars: flag to record the recent character that has been determined to NOT be in
chars CharSet
> -Adding the following conditions to prevent unnecessary/expensive calls to contains()
method. Only one call to contains() method is needed starting from the second character in
the consecutive set of duplicated characters:
> 	if ((inChars != null) && (ch == inChars))
> 		this if-statement checks if this ch char has recently been determined to be in chars
CharSet, this prevents calling contains() more than once for a consecutive set of duplicated
characters in str
> 		if one of these conditions fails, it means that the consecutive set of duplicated characters
for this ch character has not yet been checked for this block of duplicated characters (i.e.
has not been checked = we have not yet done the one-time call to contains method for this
particular set of duplicated characters)
> 	if ((notInChars == null) || (ch != notInChars))
> 		if both conditions are false, it means that ch has recently been determined to NOT
be in chars CharSet
> 	if (chars.contains(ch))
> 		here we call the expensive chars.contains() only once for this consecutive set of duplicated
> -Note: this suggested solution does not break the functionality of org.apache.commons.lang3.CharSetUtils.squeeze(String
str, String… set). This suggested solution is executed against the latest UnitTest
written by Apache.
> ************************************
> ************************************
> the current algorithm starts the for-loop with i=0 which is not a necessary step. This
adds the overhead of executing the if condition (i != 0). In this logic, the algorithm will
always append the first character from str to the buffer. So there is no need to execute (i
!= 0) for every character in the consecutive set of duplicated characters.
> ************************************
> ************************************
> lastChar is assigned the first character in str and appended to buffer before executing
the for loop. Then the for loop begins with i=1 removing the unnecessary (i != 0) executions.

This message was sent by Atlassian JIRA

View raw message