commons-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Bruno P. Kinoshita" <brunodepau...@yahoo.com.br>
Subject Re: [TEXT] Distance vs. Metric vs. Similarity
Date Thu, 16 Apr 2015 11:38:27 GMT
Hi Benedikt

>Very nice! Maybe we can even come up with a generic class that calculates a>distance
based on a similarity score.
Hmmm, that's a good idea. We probably want to keep that idea in an issue for later :-) [1]
I'll use my next development cycle on [text] to review the code and reports, and to write
the user guide with what we have already in the project.
Do you think we would need anything else before trying a 1.0 release? There are two TODO marks
in the test, but I plan to get rid of them in the next days too. But they don't seem like
a blocker right now anyway.

ThanksBruno

[1] https://issues.apache.org/jira/browse/SANDBOX-495

 
      From: Benedikt Ritter <britter@apache.org>
 To: Commons Developers List <dev@commons.apache.org> 
 Sent: Wednesday, April 15, 2015 11:03 PM
 Subject: Re: [TEXT] Distance vs. Metric vs. Similarity
   
Hi Bruno

2015-04-15 12:14 GMT+02:00 Bruno P. Kinoshita <brunodepaulak@yahoo.com.br>:

> Hi Benedikt,
>
> After playing more with [text] and some edit distances, I think we can
> retake this conversation and hopefully fix SANDBOX-488 [1].
>
> I've created a branch SANDBOX-488 in git [2] with the following
> modifications:
>
> * The StringMetric interface has been renamed to EditDistance
> * We have the following edit distances available: Levenshtein,
> JaroWrinkler, Hamming ([lang]) and Cosine. Others might be added in the
> future, such as Jaccard and QGram
> * When an edit distance returns 0, it means both strings are identical or
> at least very similar. The opposite is true, returning 1, or higher values,
> means that the strings are less close to each other
> * There are other classes that can be used for text similarity, such as
> the FuzzyScore ([lang]), and the CosineSimilarity (used by the Cosine edit
> distance). Others might be added later, such as the Jaccard Index. The
> behaviour of each of these classes varies
>
> I think it is simpler, and users will quickly understand the API. Once one
> understands what is an edit distance, s/he can guess the behaviour of any
> of its implementations.
>
> What do you think? If you agree I'd like to merge the branch and fix the
> issue.
>

Very nice! Maybe we can even come up with a generic class that calculates a
distance based on a similarity score.

Benedikt


>
> TL;DR: the similarity package contains code to work on text similarity,
> such as edit distances, but also scores / indexes and other algorithms. The
> StringMetric interface has been renamed to EditDistance, and only edit
> distances implement it
>
> TIA
> Bruno
>
> [1] https://issues.apache.org/jira/browse/SANDBOX-488
> [2]
> https://git1-us-west.apache.org/repos/asf?p=commons-text.git;a=tree;f=src/main/java/org/apache/commons/text/similarity;h=a2de9f0196b543f50c6d2c28376feb311f46eeda;hb=refs/heads/SANDBOX-488
>
>  ------------------------------
>  *From:* Benedikt Ritter <britter@apache.org>
> *To:* Commons Developers List <dev@commons.apache.org>; Bruno P.
> Kinoshita <brunodepaulak@yahoo.com.br>
> *Sent:* Friday, December 19, 2014 2:35 AM
>
> *Subject:* Re: [TEXT] Distance vs. Metric vs. Similarity
>
>
>
> 2014-12-14 23:10 GMT+01:00 Bruno P. Kinoshita <brunodepaulak@yahoo.com.br>
> :
>
> > Sounds good, although I'm not sure I understand where you are going
> with> the marker interface. What is it's purpose?
> Let's then keep the StringMetric interface and update its Javadoc.
> Thinking again, that other marker interface seems to be unnecessary.  >
> Okay, but we need to make sure all algorithms really return a
> distance> then. As I said, FuzzyDistance currently really returns a
> similarity score.> An algorithm returning a distance should return a higher
> number for higher> distances. I had a look at the code, and I think I
> understand what you are saying now. In FuzzyDistance, the higher the score,
> the closer strings are. Different than what the other algorithms return.
> I believe I found why I named that package similarity. Probably it was
> because I saw that in the stringmetric library [1]. There, Levenshtein,
> Jaccard and other algorithms are suffixed with "Metric".
> How about we keep the package as similarity and simply rename the classes
> to [Algo]Metric too? This way we will be able to accommodate other metrics
> such as the Sorensen-Dice coefficient, where the higher the coefficient,
> more similar two strings are.
> WDYT?
>
>
>
> Hey Bruno,
>
> yes we can do it that way. What I want to avoid is, that the users have to
> check the JavaDoc every time they use an algorithms. To me it would make
> sense to have a number of distance algorithms and they all return a
> distance. Or we have Similarity algorithms and they all return a
> similarity. That way users can swap out the underlying algorithms without
> changing their code.
>
> Benedikt
>
>
> CheersBruno
> [1] https://github.com/rockymadden/stringmetric
>
>
>
>      From: Benedikt Ritter <britter@apache.org>
>  To: Commons Developers List <dev@commons.apache.org>; Bruno P. Kinoshita
> <brunodepaulak@yahoo.com.br>
>  Sent: Sunday, December 14, 2014 6:45 PM
>  Subject: Re: [TEXT] Distance vs. Metric vs. Similarity
>
> Hi Bruna,
>
>
>
> 2014-12-14 21:37 GMT+01:00 Bruno P. Kinoshita <brunodepaulak@yahoo.com.br
> >:
> >
> > Hello Benedikt!
> > > Metric feels like it's something more general, but I'm not sure.
> > You're right. Metric was supposed to be a general interface,
> > representing the String Metric from the Wikipedia article.
> > >  and the interface from StringMetric to StringDistance.
> > I'm reading the Myers paper, and already have a local branch with the
> > Myers algorithm from [collections] ported to [text].
> > Perhaps we could move the StringMetric interface to o.a.c.text package,
> > and create StringDistance or EditDistance interface in
> o.a.c.text.distance.
> > This way we can have String Metrics as in Wikipedia, as being a way of
> > giving a valuefor comparing two strings. We would have the edit distances
> > in the distance package, and the diff algorithms in another diff package.
> > All of them being String Metrics.
> > What do you think?
> >
>
> Sounds good, although I'm not sure I understand where you are going with
> the marker interface. What is it's purpose?
>
>
> > > > I think we should consider renaming everything to distance, since
> > the> > implemented algorithms all end on *Distance. So we would change
> the
> > package> > name from o.a.c.text.similarity to o.a.c.text.distance and the
> > interface> > from StringMetric to StringDistance.> >>
> > > Looking at the code again, it seems like the algorithms all really
> > return a> similarity score and not a distance. For exmaple FuzzyDistance
> > JavaDoc> states: "A higher score indicates a higher similarity". If this
> is
> > a case,> maybe it makes more sense to rename everything to Similarity?
> > I'm in favor of dropping score and similarity, and adopting distance in
> > the package, classes and javadocs, as it is used in other tools (e.g.
> Solr,
> > Talend, Informatica IIR, etc).
> >
>
> Okay, but we need to make sure all algorithms really return a distance
> then. As I said, FuzzyDistance currently really returns a similarity score.
> An algorithm returning a distance should return a higher number for higher
> distances.
>
> Benedikt
>
>
> > All the best,Bruno
> >
> >
> >      From: Benedikt Ritter <britter@apache.org>
> >  To: Commons Developers List <dev@commons.apache.org>
> >  Sent: Sunday, December 14, 2014 6:20 PM
> >  Subject: Re: [TEXT] Distance vs. Metric vs. Similarity
> >
> > 2014-12-14 21:08 GMT+01:00 Benedikt Ritter <britter@apache.org>:
> > >
> > > Hi,
> > >
> > > currently the wording in commons text is a bit confusing. We have the
> > > three terms:
> > >
> > > - distance
> > > - similarity
> > > - metric
> > >
> > > Distance and similarity seem to be just opposites of the same thing. A
> > > great distance indicates a small similarity between two character
> > > sequences. Metric feels like it's something more general, but I'm not
> > sure.
> > >
> > > I think we should consider renaming everything to distance, since the
> > > implemented algorithms all end on *Distance. So we would change the
> > package
> > > name from o.a.c.text.similarity to o.a.c.text.distance and the
> interface
> > > from StringMetric to StringDistance.
> > >
> >
> > Looking at the code again, it seems like the algorithms all really
> return a
> > similarity score and not a distance. For exmaple FuzzyDistance JavaDoc
> > states: "A higher score indicates a higher similarity". If this is a
> case,
> > maybe it makes more sense to rename everything to Similarity?
> >
> >
> > >
> > > WDYT?
> > >
> > > Benedikt
> > >
> > > --
> > > http://people.apache.org/~britter/
> > > http://www.systemoutprintln.de/
> > > http://twitter.com/BenediktRitter
> > > http://github.com/britter


>
>
>
>
>
> >
> >
> > >
> >
> >
> > --
> > http://people.apache.org/~britter/
> > http://www.systemoutprintln.de/
> > http://twitter.com/BenediktRitter
> > http://github.com/britter
> >
> >
> >
> >
>
> --
> http://people.apache.org/~britter/
> http://www.systemoutprintln.de/
> http://twitter.com/BenediktRitter
> http://github.com/britter
>
>
>
>
>
> --
> http://people.apache.org/~britter/
> http://www.systemoutprintln.de/
> http://twitter.com/BenediktRitter
> http://github.com/britter
>
>
>


-- 
http://people.apache.org/~britter/
http://www.systemoutprintln.de/
http://twitter.com/BenediktRitter
http://github.com/britter


   

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message