commons-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Henri Yandell <flame...@gmail.com>
Subject Re: [LANG] New class called StringAlgorithms?
Date Wed, 22 Jan 2014 08:34:09 GMT
On Mon, Jan 20, 2014 at 8:01 AM, Benedikt Ritter <britter@apache.org> wrote:

> 2014/1/18 Oliver Heger <oliver.heger@oliver-heger.de>
>
> >
> >
> > Am 18.01.2014 17:40, schrieb Emmanuel Bourg:
> > > Le 18/01/2014 16:04, Benedikt Ritter a écrit :
> > >
> > >> About putting this into codec: I still don't think this is a good fit
> > for
> > >> this contribution. Codec is about, well decoding and encoding stuff.
> > Jaro
> > >> Winkler and Levenshtein Distance are more like scores or metrics that
> > help
> > >> in comparing strings.
> > >
> > > The point is, string metrics and soundex algorithm are often used to
> > > find similarities between words. That's a bit odd to have them in
> > > separate packages. That being said, string metrics doesn't look like a
> > > good fit for codec since it doesn't encode anything.
> >
> > From a logic PoV I agree with Emmanuel that a separate Text component
> > would make sense. It could also contain other stuff like special search
> > algorithms or trie implementations.
> >
> > From an organizational PoV I also understand Gary: It is unlikely that
> > we have the energy and man power to keep such a new component alive -
> > except someone steps up now?
> >
> > So I am on the fence. In past we have always tried to keep [lang] very
> > focused and lean.
> >
>
> Well these string distance metrics could be seen as an addition to
> java.lang.String. In this regard a StringDistanceMetrics class would fit
> into [lang].


I don't recall why we sent things like Soundex and Metaphone from Lang to
Codec but not Levenstein. There was lots of debate and I'm guessing it was
because of the API not being transformative on the input but instead
comparative. I think that still holds.

My thinking - keep it simple for 3.3, figure out bigger picture for 4.0 if
simple was too simple.

What I'm tempted to think about is splitting up StringUtils in 4.0. Make it
more manageable and easier to find methods in. At 188 methods I think this
is worth considering.

I would be tempted by "StringCompare.getLevensteinDistance(...)".
countMatches(String, String) would join them. Maybe all the
startsWith/endsWith methods. Thinking out loud. Premature though for 3.3 :)

For now I'm in favour of putting jaroWinkler in StringUtils and putting off
the bigger question of StringUtils being so big. Removing the two
Levenstein methods will see a change of 188 to 186 methods - no real impact
to anybody.

Hen

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message