commons-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Bruno P. Kinoshita" <brunodepau...@yahoo.com.br>
Subject Re: [TEXT] Distance vs. Metric vs. Similarity
Date Sun, 14 Dec 2014 22:10:31 GMT
> Sounds good, although I'm not sure I understand where you are going with> the marker
interface. What is it's purpose?
Let's then keep the StringMetric interface and update its Javadoc. Thinking again, that other marker
interface seems to be unnecessary.  > Okay, but we need to make sure all algorithms really
return a distance> then. As I said, FuzzyDistance currently really returns a similarity
score.> An algorithm returning a distance should return a higher number for higher> distances. I
had a look at the code, and I think I understand what you are saying now. In FuzzyDistance, the
higher the score, the closer strings are. Different than what the other algorithms return.
I believe I found why I named that package similarity. Probably it was because I saw that in
the stringmetric library [1]. There, Levenshtein, Jaccard and other algorithms are suffixed
with "Metric".
How about we keep the package as similarity and simply rename the classes to [Algo]Metric
too? This way we will be able to accommodate other metrics such as the Sorensen-Dice coefficient, where
the higher the coefficient, more similar two strings are.
WDYT?
CheersBruno 
[1] https://github.com/rockymadden/stringmetric
 


      From: Benedikt Ritter <britter@apache.org>
 To: Commons Developers List <dev@commons.apache.org>; Bruno P. Kinoshita <brunodepaulak@yahoo.com.br>

 Sent: Sunday, December 14, 2014 6:45 PM
 Subject: Re: [TEXT] Distance vs. Metric vs. Similarity
   
Hi Bruna,



2014-12-14 21:37 GMT+01:00 Bruno P. Kinoshita <brunodepaulak@yahoo.com.br>:
>
> Hello Benedikt!
> > Metric feels like it's something more general, but I'm not sure.
> You're right. Metric was supposed to be a general interface,
> representing the String Metric from the Wikipedia article.
> >  and the interface from StringMetric to StringDistance.
> I'm reading the Myers paper, and already have a local branch with the
> Myers algorithm from [collections] ported to [text].
> Perhaps we could move the StringMetric interface to o.a.c.text package,
> and create StringDistance or EditDistance interface in o.a.c.text.distance.
> This way we can have String Metrics as in Wikipedia, as being a way of
> giving a valuefor comparing two strings. We would have the edit distances
> in the distance package, and the diff algorithms in another diff package.
> All of them being String Metrics.
> What do you think?
>

Sounds good, although I'm not sure I understand where you are going with
the marker interface. What is it's purpose?


> > > I think we should consider renaming everything to distance, since
> the> > implemented algorithms all end on *Distance. So we would change the
> package> > name from o.a.c.text.similarity to o.a.c.text.distance and the
> interface> > from StringMetric to StringDistance.> >>
> > Looking at the code again, it seems like the algorithms all really
> return a> similarity score and not a distance. For exmaple FuzzyDistance
> JavaDoc> states: "A higher score indicates a higher similarity". If this is
> a case,> maybe it makes more sense to rename everything to Similarity?
> I'm in favor of dropping score and similarity, and adopting distance in
> the package, classes and javadocs, as it is used in other tools (e.g. Solr,
> Talend, Informatica IIR, etc).
>

Okay, but we need to make sure all algorithms really return a distance
then. As I said, FuzzyDistance currently really returns a similarity score.
An algorithm returning a distance should return a higher number for higher
distances.

Benedikt


> All the best,Bruno
>
>
>      From: Benedikt Ritter <britter@apache.org>
>  To: Commons Developers List <dev@commons.apache.org>
>  Sent: Sunday, December 14, 2014 6:20 PM
>  Subject: Re: [TEXT] Distance vs. Metric vs. Similarity
>
> 2014-12-14 21:08 GMT+01:00 Benedikt Ritter <britter@apache.org>:
> >
> > Hi,
> >
> > currently the wording in commons text is a bit confusing. We have the
> > three terms:
> >
> > - distance
> > - similarity
> > - metric
> >
> > Distance and similarity seem to be just opposites of the same thing. A
> > great distance indicates a small similarity between two character
> > sequences. Metric feels like it's something more general, but I'm not
> sure.
> >
> > I think we should consider renaming everything to distance, since the
> > implemented algorithms all end on *Distance. So we would change the
> package
> > name from o.a.c.text.similarity to o.a.c.text.distance and the interface
> > from StringMetric to StringDistance.
> >
>
> Looking at the code again, it seems like the algorithms all really return a
> similarity score and not a distance. For exmaple FuzzyDistance JavaDoc
> states: "A higher score indicates a higher similarity". If this is a case,
> maybe it makes more sense to rename everything to Similarity?
>
>
> >
> > WDYT?
> >
> > Benedikt
> >
> > --
> > http://people.apache.org/~britter/
> > http://www.systemoutprintln.de/
> > http://twitter.com/BenediktRitter
> > http://github.com/britter


>
>
> >
>
>
> --
> http://people.apache.org/~britter/
> http://www.systemoutprintln.de/
> http://twitter.com/BenediktRitter
> http://github.com/britter
>
>
>
>

-- 
http://people.apache.org/~britter/
http://www.systemoutprintln.de/
http://twitter.com/BenediktRitter
http://github.com/britter


   

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message