commons-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Benedikt Ritter <brit...@apache.org>
Subject Re: [TEXT] Distance vs. Metric vs. Similarity
Date Thu, 18 Dec 2014 13:35:46 GMT
2014-12-14 23:10 GMT+01:00 Bruno P. Kinoshita <brunodepaulak@yahoo.com.br>:
>
> > Sounds good, although I'm not sure I understand where you are going
> with> the marker interface. What is it's purpose?
> Let's then keep the StringMetric interface and update its Javadoc.
> Thinking again, that other marker interface seems to be unnecessary.  >
> Okay, but we need to make sure all algorithms really return a
> distance> then. As I said, FuzzyDistance currently really returns a
> similarity score.> An algorithm returning a distance should return a higher
> number for higher> distances. I had a look at the code, and I think I
> understand what you are saying now. In FuzzyDistance, the higher the score,
> the closer strings are. Different than what the other algorithms return.
> I believe I found why I named that package similarity. Probably it was
> because I saw that in the stringmetric library [1]. There, Levenshtein,
> Jaccard and other algorithms are suffixed with "Metric".
> How about we keep the package as similarity and simply rename the classes
> to [Algo]Metric too? This way we will be able to accommodate other metrics
> such as the Sorensen-Dice coefficient, where the higher the coefficient,
> more similar two strings are.
> WDYT?
>


Hey Bruno,

yes we can do it that way. What I want to avoid is, that the users have to
check the JavaDoc every time they use an algorithms. To me it would make
sense to have a number of distance algorithms and they all return a
distance. Or we have Similarity algorithms and they all return a
similarity. That way users can swap out the underlying algorithms without
changing their code.

Benedikt


> CheersBruno
> [1] https://github.com/rockymadden/stringmetric
>
>
>
>       From: Benedikt Ritter <britter@apache.org>
>  To: Commons Developers List <dev@commons.apache.org>; Bruno P. Kinoshita
> <brunodepaulak@yahoo.com.br>
>  Sent: Sunday, December 14, 2014 6:45 PM
>  Subject: Re: [TEXT] Distance vs. Metric vs. Similarity
>
> Hi Bruna,
>
>
>
> 2014-12-14 21:37 GMT+01:00 Bruno P. Kinoshita <brunodepaulak@yahoo.com.br
> >:
> >
> > Hello Benedikt!
> > > Metric feels like it's something more general, but I'm not sure.
> > You're right. Metric was supposed to be a general interface,
> > representing the String Metric from the Wikipedia article.
> > >  and the interface from StringMetric to StringDistance.
> > I'm reading the Myers paper, and already have a local branch with the
> > Myers algorithm from [collections] ported to [text].
> > Perhaps we could move the StringMetric interface to o.a.c.text package,
> > and create StringDistance or EditDistance interface in
> o.a.c.text.distance.
> > This way we can have String Metrics as in Wikipedia, as being a way of
> > giving a valuefor comparing two strings. We would have the edit distances
> > in the distance package, and the diff algorithms in another diff package.
> > All of them being String Metrics.
> > What do you think?
> >
>
> Sounds good, although I'm not sure I understand where you are going with
> the marker interface. What is it's purpose?
>
>
> > > > I think we should consider renaming everything to distance, since
> > the> > implemented algorithms all end on *Distance. So we would change
> the
> > package> > name from o.a.c.text.similarity to o.a.c.text.distance and the
> > interface> > from StringMetric to StringDistance.> >>
> > > Looking at the code again, it seems like the algorithms all really
> > return a> similarity score and not a distance. For exmaple FuzzyDistance
> > JavaDoc> states: "A higher score indicates a higher similarity". If this
> is
> > a case,> maybe it makes more sense to rename everything to Similarity?
> > I'm in favor of dropping score and similarity, and adopting distance in
> > the package, classes and javadocs, as it is used in other tools (e.g.
> Solr,
> > Talend, Informatica IIR, etc).
> >
>
> Okay, but we need to make sure all algorithms really return a distance
> then. As I said, FuzzyDistance currently really returns a similarity score.
> An algorithm returning a distance should return a higher number for higher
> distances.
>
> Benedikt
>
>
> > All the best,Bruno
> >
> >
> >      From: Benedikt Ritter <britter@apache.org>
> >  To: Commons Developers List <dev@commons.apache.org>
> >  Sent: Sunday, December 14, 2014 6:20 PM
> >  Subject: Re: [TEXT] Distance vs. Metric vs. Similarity
> >
> > 2014-12-14 21:08 GMT+01:00 Benedikt Ritter <britter@apache.org>:
> > >
> > > Hi,
> > >
> > > currently the wording in commons text is a bit confusing. We have the
> > > three terms:
> > >
> > > - distance
> > > - similarity
> > > - metric
> > >
> > > Distance and similarity seem to be just opposites of the same thing. A
> > > great distance indicates a small similarity between two character
> > > sequences. Metric feels like it's something more general, but I'm not
> > sure.
> > >
> > > I think we should consider renaming everything to distance, since the
> > > implemented algorithms all end on *Distance. So we would change the
> > package
> > > name from o.a.c.text.similarity to o.a.c.text.distance and the
> interface
> > > from StringMetric to StringDistance.
> > >
> >
> > Looking at the code again, it seems like the algorithms all really
> return a
> > similarity score and not a distance. For exmaple FuzzyDistance JavaDoc
> > states: "A higher score indicates a higher similarity". If this is a
> case,
> > maybe it makes more sense to rename everything to Similarity?
> >
> >
> > >
> > > WDYT?
> > >
> > > Benedikt
> > >
> > > --
> > > http://people.apache.org/~britter/
> > > http://www.systemoutprintln.de/
> > > http://twitter.com/BenediktRitter
> > > http://github.com/britter
>
>
> >
> >
> > >
> >
> >
> > --
> > http://people.apache.org/~britter/
> > http://www.systemoutprintln.de/
> > http://twitter.com/BenediktRitter
> > http://github.com/britter
> >
> >
> >
> >
>
> --
> http://people.apache.org/~britter/
> http://www.systemoutprintln.de/
> http://twitter.com/BenediktRitter
> http://github.com/britter
>
>
>
>

-- 
http://people.apache.org/~britter/
http://www.systemoutprintln.de/
http://twitter.com/BenediktRitter
http://github.com/britter

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message