commons-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gary Gregory <garydgreg...@gmail.com>
Subject Re: [LANG] New class called StringAlgorithms?
Date Wed, 22 Jan 2014 12:20:29 GMT
This all sounds reasonable. 

G

-------- Original message --------
From: Benedikt Ritter <beneritter@gmail.com> 
Date:01/22/2014  05:15  (GMT-05:00) 
To: Commons Developers List <dev@commons.apache.org> 
Subject: Re: [LANG] New class called StringAlgorithms? 

Hello,

2014/1/22 Henri Yandell <flamefew@gmail.com>

>  On Mon, Jan 20, 2014 at 8:01 AM, Benedikt Ritter <britter@apache.org>
> wrote:
>
> > 2014/1/18 Oliver Heger <oliver.heger@oliver-heger.de>
> >
> > >
> > >
> > > Am 18.01.2014 17:40, schrieb Emmanuel Bourg:
> > > > Le 18/01/2014 16:04, Benedikt Ritter a écrit :
> > > >
> > > >> About putting this into codec: I still don't think this is a good
> fit
> > > for
> > > >> this contribution. Codec is about, well decoding and encoding stuff.
> > > Jaro
> > > >> Winkler and Levenshtein Distance are more like scores or metrics
> that
> > > help
> > > >> in comparing strings.
> > > >
> > > > The point is, string metrics and soundex algorithm are often used to
> > > > find similarities between words. That's a bit odd to have them in
> > > > separate packages. That being said, string metrics doesn't look like
> a
> > > > good fit for codec since it doesn't encode anything.
> > >
> > > From a logic PoV I agree with Emmanuel that a separate Text component
> > > would make sense. It could also contain other stuff like special search
> > > algorithms or trie implementations.
> > >
> > > From an organizational PoV I also understand Gary: It is unlikely that
> > > we have the energy and man power to keep such a new component alive -
> > > except someone steps up now?
> > >
> > > So I am on the fence. In past we have always tried to keep [lang] very
> > > focused and lean.
> > >
> >
> > Well these string distance metrics could be seen as an addition to
> > java.lang.String. In this regard a StringDistanceMetrics class would fit
> > into [lang].
>
>
> I don't recall why we sent things like Soundex and Metaphone from Lang to
> Codec but not Levenstein. There was lots of debate and I'm guessing it was
> because of the API not being transformative on the input but instead
> comparative. I think that still holds.
>

Makes sense


>
> My thinking - keep it simple for 3.3, figure out bigger picture for 4.0 if
> simple was too simple.
>
> What I'm tempted to think about is splitting up StringUtils in 4.0. Make it
> more manageable and easier to find methods in. At 188 methods I think this
> is worth considering.
>

Makes sense.


>
> I would be tempted by "StringCompare.getLevensteinDistance(...)".
> countMatches(String, String) would join them. Maybe all the
> startsWith/endsWith methods. Thinking out loud. Premature though for 3.3 :)
>
> For now I'm in favour of putting jaroWinkler in StringUtils and putting off
> the bigger question of StringUtils being so big. Removing the two
> Levenstein methods will see a change of 188 to 186 methods - no real impact
> to anybody.
>

Yes. I'd prefer this solution, since I want to give contributors the
feeling that their contributions end up in trunk and are ready for use in
ucoming releases. If yu cntribute stuff that never ends up in a release
that will frustrate you. So keeping things simple and figure out the big
picture for 4.0 is a good idea.

Benedikt


>
> Hen
>
Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message