commons-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Bruno P. Kinoshita" <brunodepau...@yahoo.com.br>
Subject Re: [TEXT] Distance vs. Metric vs. Similarity
Date Sun, 14 Dec 2014 20:37:34 GMT
Hello Benedikt!
> Metric feels like it's something more general, but I'm not sure.
You're right. Metric was supposed to be a general interface, representing the String Metric
from the Wikipedia article.
>  and the interface from StringMetric to StringDistance.
I'm reading the Myers paper, and already have a local branch with the Myers algorithm from
[collections] ported to [text]. 
Perhaps we could move the StringMetric interface to o.a.c.text package, and create StringDistance
or EditDistance interface in o.a.c.text.distance.
This way we can have String Metrics as in Wikipedia, as being a way of giving a valuefor comparing
two strings. We would have the edit distances in the distance package, and the diff algorithms
in another diff package. All of them being String Metrics. 
What do you think?
> > I think we should consider renaming everything to distance, since the> >
implemented algorithms all end on *Distance. So we would change the package> > name
from o.a.c.text.similarity to o.a.c.text.distance and the interface> > from StringMetric
to StringDistance.> >> 
> Looking at the code again, it seems like the algorithms all really return a> similarity
score and not a distance. For exmaple FuzzyDistance JavaDoc> states: "A higher score indicates
a higher similarity". If this is a case,> maybe it makes more sense to rename everything
to Similarity?
I'm in favor of dropping score and similarity, and adopting distance in the package, classes
and javadocs, as it is used in other tools (e.g. Solr, Talend, Informatica IIR, etc).
All the best,Bruno

 
      From: Benedikt Ritter <britter@apache.org>
 To: Commons Developers List <dev@commons.apache.org> 
 Sent: Sunday, December 14, 2014 6:20 PM
 Subject: Re: [TEXT] Distance vs. Metric vs. Similarity
   
2014-12-14 21:08 GMT+01:00 Benedikt Ritter <britter@apache.org>:
>
> Hi,
>
> currently the wording in commons text is a bit confusing. We have the
> three terms:
>
> - distance
> - similarity
> - metric
>
> Distance and similarity seem to be just opposites of the same thing. A
> great distance indicates a small similarity between two character
> sequences. Metric feels like it's something more general, but I'm not sure.
>
> I think we should consider renaming everything to distance, since the
> implemented algorithms all end on *Distance. So we would change the package
> name from o.a.c.text.similarity to o.a.c.text.distance and the interface
> from StringMetric to StringDistance.
>

Looking at the code again, it seems like the algorithms all really return a
similarity score and not a distance. For exmaple FuzzyDistance JavaDoc
states: "A higher score indicates a higher similarity". If this is a case,
maybe it makes more sense to rename everything to Similarity?


>
> WDYT?
>
> Benedikt
>
> --
> http://people.apache.org/~britter/
> http://www.systemoutprintln.de/
> http://twitter.com/BenediktRitter
> http://github.com/britter


>


-- 
http://people.apache.org/~britter/
http://www.systemoutprintln.de/
http://twitter.com/BenediktRitter
http://github.com/britter


   

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message