commons-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sebb (Jira)" <j...@apache.org>
Subject [jira] [Commented] (LANG-1606) StringUtils.countMatches returns incorrect value while handling intersecting substrings
Date Tue, 01 Sep 2020 12:11:00 GMT

    [ https://issues.apache.org/jira/browse/LANG-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17188393#comment-17188393
] 

Sebb commented on LANG-1606:
----------------------------

There are two ways to count matches: overlapping or non-overlapping.
The code currently correctly counts non-overlapping matches.

I agree that the Javadoc is not clear on this, however I don't think it is actually wrong

Given that users may rely on the current behaviour, generally it is the Javadoc that must
be changed rather than the code.

> StringUtils.countMatches returns incorrect value while handling intersecting substrings
> ---------------------------------------------------------------------------------------
>
>                 Key: LANG-1606
>                 URL: https://issues.apache.org/jira/browse/LANG-1606
>             Project: Commons Lang
>          Issue Type: Bug
>          Components: lang.*
>    Affects Versions: 3.11
>            Reporter: Rustem Galiev
>            Priority: Major
>          Time Spent: 20m
>  Remaining Estimate: 0h
>
> Steps to reproduce:
> 1. Call the method like that:
> {code:java}
> int count = StringUtils.countMatches("abaabaababaab", "aba");
> {code}
> Actual result: the value of count variable equals 3
>  Expected result: the value of count variable equals 4
> The substrings are highlighted in red:
>  {color:#ff0000}aba{color}abaababaab
>  aba{color:#ff0000}aba{color}ababaab
>  abaaba{color:#ff0000}aba{color}baab
>  abaabaab{color:#ff0000}aba{color}ab
> Method returns incorrect value because of this code:
> {code:java}
> while ((idx = CharSequenceUtils.indexOf(str, sub, idx)) != INDEX_NOT_FOUND) {
>     count++;
>     idx += sub.length();
> }
> {code}
> This looks like a greedy algorithm - but increasing the idx variable by the length of
substring could lead to the problems like in example:
> Let's say that idx = 6, so we try to find a substring in the highlighted suffix:
>  abaaba{color:#ff0000}ababaab{color}
> We found the substring, so idx now becomes idx + 3 = 9. So now this suffix will be used
for searching substring in it:
>  abaabaaba{color:#ff0000}baab{color}
>  But because of increasing the value of idx by 3 we won't find the substring (abaabaab{color:#ff0000}aba{color}ab)
which intersects with the already found substring on the last step.
> Basically, this method will work incorrectly with any substrings that intersect with
each other.
> There is also a unit test with incorrect expected value:
> {code:java}
> assertEquals(4,
>      StringUtils.countMatches("oooooooooooo", "ooo"));
> {code}
> If this behavior (counting substrings that do not intersect) is intended, please update
the JavaDoc to reflect it. Right now it looks like that:
> {code:java}
> Counts how many times the substring appears in the larger string.
> {code}
> Link for the PR: https://github.com/apache/commons-lang/pull/615



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Mime
View raw message