commons-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Bruno P. Kinoshita" <brunodepau...@yahoo.com.br.INVALID>
Subject [text] Re: CharSequence vs. String (was Re: [GitHub] commons-text pull request #46: TEXT-85:Added CaseUtils class with camel case...)
Date Wed, 21 Jun 2017 11:07:06 GMT
>If a method doesn't intrinsically require a String, then I prefer
CharSequence. It's probable that sooner or later something is going to
demand a String, but that's not a good reason to be "that guy" :-)
I lean towards using CharSequence when that makes sense too (i.e. suggesting we are working
on code points, and supporting implementations of charsequence). The tdebatty/java-string-similarity
library work only Strings I think. Others like LingPipe, ICU4J, Lucene, Apache Commons Text,
and Apache OpenNLP use both CharSequence and String.

Analysing the use of CharSequence and String could be an interesting idea for a blog post,
and could even raise some tickets to fix consistency in the API of [text] or some other component/project.

>Also, wouldn't  some sort of low-space-overhead string storage be a good fit for text?

Sounds interesting. Normally when I have some idea like that for [text] (or for other projects/components)
I either note it down somewhere (normally first at http://kinoshita.eti.br/todo/), and then
file an issue like TEXT-71, TEXT-77, TEXT-78, or TEXT-79, to start investigating it.

If you have some idea of how that could be implemented, or know about some projects for that,
feel free to suggest it in a JIRA ticket, or start another thread here in the mailing list.

Cheers
Bruno

________________________________

From: Simon Spero <sesuncedu@gmail.com>
To: Commons Developers List <dev@commons.apache.org> 
Sent: Tuesday, 20 June 2017 1:39 AM
Subject: CharSequence vs. String (was Re: [GitHub] commons-text pull request #46: TEXT-85:Added
CaseUtils class with camel case...)



On Jun 12, 2017 10:47 AM, "arunvinudss" <git@git.apache.org> wrote:


Github user arunvinudss commented on a diff in the pull request:


    I am a bit biased towards using String instead of CharSequence . Yes

CharSequence allows us to pass String Buffers and builders and other types

as input potentially increasing the scope of the function but considering

the nature of work we do in this particular method it may not necessarily

be a good idea. My basic contention is that the minute we call toString()

on a charSequence  to do any sort of manipulation it becomes a costly

operation and we may lose performance .



True if the particular CharSequence is not in fact an instance of String.

String::toString returns this.


The bigger problem is that too many methods use String as a parameter or

return type, when  CharSequence would serve just as well. This indeed

requires the invocation of Object::toString.


For methods that use String as the return type, changing the result to

CharSequence is source and binary incompatible, and properly so (since at

some point the user may actually need a String).


A  generic method with Type parameter with CharSequence as bound (T extends

CharSequence) can sometimes be useful, and can be added in addition to

methods taking String arguments, but can't replace them.


There are some places in javac that have special treatment for String - for

example, the + operator , but jdk9 reduces that particular win by indyfying

concat.

If a method doesn't intrinsically require a String, then I prefer

CharSequence. It's probable that sooner or later something is going to

demand a String, but that's not a good reason to be "that guy" :-)


Note:

Strings can be an incredible waste of memory; 40 +  ⌈length/4⌉  bytes

(reduced to a mere  40 + ⌈length/8⌉ bytes in jdk9 when compact strings can

be used).


This is incredibly painful if you have a vast number of small "strings",

which may not all need to be materialized simultaneously. See e.g. [1]

(~50MiB of UTF-8 chars becomes ~250MiB of Strings. And since there's no

individual humongous object  they all get to make the journey from TLAB to

Old Space the hard way. Note this predates jdk 9,but illustrates some of

the win from compact strings)


Storing the character data in a shared byte array is a huge win. Someone

should tell the jdk implementors to look at applications that do this.

Like, um, javac :-)


Materializing these strings as possibly transient  CharSequence's  is

really convenient... until some method just has to have a String


Also, wouldn't  some sort of low-space-overhead string storage be a good

fit for text?


Simon

[1]  Spero,S. (2015). Time And Relative Dimensions In Semantics: Is OWL

Bigger On The Inside? OWLED 2015. Available at

http://cgi.csc.liv.ac.uk/~valli/OWLED2015/OWLED_2015_paper_12.pdf

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org


Mime
View raw message