commons-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (LANG-1300) Clarify or improve behaviour of int-based methods in StringUtils
Date Tue, 07 Mar 2017 13:19:38 GMT

    [ https://issues.apache.org/jira/browse/LANG-1300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15899421#comment-15899421
] 

ASF GitHub Bot commented on LANG-1300:
--------------------------------------

Github user chtompki commented on the issue:

    https://github.com/apache/commons-lang/pull/251
  
    @dmjones500 - no worries on the being busy, we all end up there for time to time... :-)

    
    @dmjones500 has an interesting point. The problem seems to lie with the number of "Supplementary
Code Points" preceding the *findable* `searchChar` that have been previously split into their
complementary surrogate pairs.  
    
    You may need to consider using `Character.isSurrogate(char ch)` as well as `Character.isSurrogatePair(char
high, char low)` for all characters preceding our *findable* code point. Granted, that adds
an *O(n)* multiplier on our method's efficiency pushing us to *O(n<sup>2</sup>)*.
It feels like only then can we be absolutely certain that we are not over counting using *code
units* as opposed to *code points*. 
    
    If indeed we do move this direction, we should be quite clear, in the javadoc, that there
is a notable performance reduction when operating outside the "Basic Multilingual Plane" (ref.
[Oracle's Character documentation](https://docs.oracle.com/javase/8/docs/api/java/lang/Character.html#supplementary)).
    
    @PascalSchumacher - you have any thoughts here as well?


> Clarify or improve behaviour of int-based methods in StringUtils
> ----------------------------------------------------------------
>
>                 Key: LANG-1300
>                 URL: https://issues.apache.org/jira/browse/LANG-1300
>             Project: Commons Lang
>          Issue Type: Improvement
>          Components: lang.*
>    Affects Versions: 3.5
>            Reporter: Duncan Jones
>            Priority: Minor
>             Fix For: Discussion
>
>
> The following methods use an {{int}} to represent a search character:
> {code:java}
> boolean contains(final CharSequence seq, final int searchChar)
> int indexOf(final CharSequence seq, final int searchChar)
> int indexOf(final CharSequence seq, final int searchChar, final int startPos)
> int lastIndexOf(final CharSequence seq, final int searchChar)
> int lastIndexOf(final CharSequence seq, final int searchChar, final int startPos)
> {code}
> When I see an {{int}} representing a character, I tend to assume the method can handle
supplementary characters. However, the current behaviour of these methods depends upon whether
the {{CharSequence}} is a {{String}} or not.
> {code:java}
> StringBuilder builder = new StringBuilder();
> builder.appendCodePoint(0x2070E);
> System.out.println(StringUtils.lastIndexOf(builder, 0x2070E)); // -1
> System.out.println(StringUtils.lastIndexOf(builder.toString(), 0x2070E)); // 0
> {code}
> The Javadoc for these methods are ambiguous on this point, stating:
> {quote}
> This method uses {{String.lastIndexOf(int)}} if possible.
> {quote}
> I think we should consider updating the {{CharSequenceUtils}} methods used by this class
to convert all {{CharSequence}} parameters to strings, enabling full code point support. The
docs could be updated to make this crystal clear.
> There is a question of whether this breaks backwards compatibility.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Mime
View raw message