Return-Path: X-Original-To: apmail-commons-dev-archive@www.apache.org Delivered-To: apmail-commons-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 8600717241 for ; Thu, 16 Apr 2015 11:41:13 +0000 (UTC) Received: (qmail 37256 invoked by uid 500); 16 Apr 2015 11:41:00 -0000 Delivered-To: apmail-commons-dev-archive@commons.apache.org Received: (qmail 37110 invoked by uid 500); 16 Apr 2015 11:41:00 -0000 Mailing-List: contact dev-help@commons.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: "Commons Developers List" Delivered-To: mailing list dev@commons.apache.org Received: (qmail 37098 invoked by uid 99); 16 Apr 2015 11:41:00 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 16 Apr 2015 11:41:00 +0000 X-ASF-Spam-Status: No, hits=2.7 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_NONE,REPTO_QUOTE_YAHOO,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: local policy) Received: from [98.136.217.70] (HELO nm22-vm7.bullet.mail.gq1.yahoo.com) (98.136.217.70) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 16 Apr 2015 11:40:54 +0000 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=yahoo.com.br; s=s2048; t=1429184308; bh=JjkkQSf+mGJkNt0wSXpMO2S5VEhOynUqQqHiF2OqNAg=; h=Date:From:Reply-To:To:In-Reply-To:References:Subject:From:Subject; b=Bcdx7yF7cD9rUE5DBOZV2y4HB1jBIkUlmx3YN2l/PvY1vwXLLXtGqOCsvC0TTntWgHTTVngQKyxfY91s40MjKbjCGECErow/udtdkU1xztJSKsIu36JNhpozOKrDdFBF3ZCfo3lsZKfq0wmpsZ4i1DkSqnJqBIXbEVs1t3jPUK4KVERhh0/pcN64bEVRTtHzIVEh0WVBgfobyfmYj7CwaglPgLID8CTqZey73q43ebMnvoh5gcMY0MMiAYQjAD9fQlD73MF/WlpcWDQ2o5nHLHJ5xrppFHCXPGqXwp1ZHLb7Y1xUOJgyI0LFFa+H5Y/LkYnY8fFVi62YvfqpL3YPPA== Received: from [98.137.12.63] by nm22.bullet.mail.gq1.yahoo.com with NNFMP; 16 Apr 2015 11:38:28 -0000 Received: from [98.137.12.231] by tm8.bullet.mail.gq1.yahoo.com with NNFMP; 16 Apr 2015 11:38:28 -0000 Received: from [127.0.0.1] by omp1039.mail.gq1.yahoo.com with NNFMP; 16 Apr 2015 11:38:28 -0000 X-Yahoo-Newman-Property: ymail-3 X-Yahoo-Newman-Id: 628586.12282.bm@omp1039.mail.gq1.yahoo.com X-YMail-OSG: xdHVyMMVM1mWO6H.lIO1zdVsGxQ5fHIgp6AkQKnIlVmGoG_88uwfEuEWrIcy9im 1e5Z3Wr3zot_0pOTcGlR8nYoKpAQmw2iiIQIodLX9mScGcyDNu6OQpyBIm4atI_ntxdxpKWfzt78 4wR3.QIr21MnHa_YP0JSR_y4sMlQRWEOK7TqPv.m._3o46CE1tmnNRKevMFcAlQx8YZFgBHG3K90 syD9aPytCbMRdp6Fkm5mcC9E5zv90o79LHxGSWN8VxnDRP_mkKBrNezLEofOpQZiBYi0UkTlRSfI UyYmNekdnO7Gppz29tvtwd7TKkk7fxiXC0Nt905kV5_qmcnyzzSavZjwOOnbJVqt57v5664zkMOf XWW.id7z.ZZ07YYW_cf9f2vW4pWO6Q7et.YeCRMZd9S3Z6QTmAsr1WDLdoz6DRR3v2YxiRDawe67 ro09qqHtNJVyBYdBFzlXv2H9sMtWKxoQKjY0O1GESHv6kAWuJiyV27glWnVmWesrvucEJULoraRb O_6SdYWg8po_juQ-- Received: by 216.39.60.155; Thu, 16 Apr 2015 11:38:28 +0000 Date: Thu, 16 Apr 2015 11:38:27 +0000 (UTC) From: "Bruno P. Kinoshita" Reply-To: "Bruno P. Kinoshita" To: Commons Developers List Message-ID: <350394023.1689972.1429184307586.JavaMail.yahoo@mail.yahoo.com> In-Reply-To: References: Subject: Re: [TEXT] Distance vs. Metric vs. Similarity MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="----=_Part_1689971_126916281.1429184307575" X-Virus-Checked: Checked by ClamAV on apache.org ------=_Part_1689971_126916281.1429184307575 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Hi Benedikt >Very nice! Maybe we can even come up with a generic class that calculates = a>distance based on a similarity score. Hmmm, that's a good idea. We probably want to keep that idea in an issue fo= r later :-) [1] I'll use my next development cycle on [text] to review the = code and reports, and to write the user guide with what we have already in = the project. Do you think we would need anything else before trying a 1.0 release? There= are two TODO marks in the test, but I plan to get rid of them in the next = days too. But they don't seem like a blocker right now anyway. ThanksBruno [1] https://issues.apache.org/jira/browse/SANDBOX-495 =20 From: Benedikt Ritter To: Commons Developers List =20 Sent: Wednesday, April 15, 2015 11:03 PM Subject: Re: [TEXT] Distance vs. Metric vs. Similarity =20 Hi Bruno 2015-04-15 12:14 GMT+02:00 Bruno P. Kinoshita : > Hi Benedikt, > > After playing more with [text] and some edit distances, I think we can > retake this conversation and hopefully fix SANDBOX-488 [1]. > > I've created a branch SANDBOX-488 in git [2] with the following > modifications: > > * The StringMetric interface has been renamed to EditDistance > * We have the following edit distances available: Levenshtein, > JaroWrinkler, Hamming ([lang]) and Cosine. Others might be added in the > future, such as Jaccard and QGram > * When an edit distance returns 0, it means both strings are identical or > at least very similar. The opposite is true, returning 1, or higher value= s, > means that the strings are less close to each other > * There are other classes that can be used for text similarity, such as > the FuzzyScore ([lang]), and the CosineSimilarity (used by the Cosine edi= t > distance). Others might be added later, such as the Jaccard Index. The > behaviour of each of these classes varies > > I think it is simpler, and users will quickly understand the API. Once on= e > understands what is an edit distance, s/he can guess the behaviour of any > of its implementations. > > What do you think? If you agree I'd like to merge the branch and fix the > issue. > Very nice! Maybe we can even come up with a generic class that calculates a distance based on a similarity score. Benedikt > > TL;DR: the similarity package contains code to work on text similarity, > such as edit distances, but also scores / indexes and other algorithms. T= he > StringMetric interface has been renamed to EditDistance, and only edit > distances implement it > > TIA > Bruno > > [1] https://issues.apache.org/jira/browse/SANDBOX-488 > [2] > https://git1-us-west.apache.org/repos/asf?p=3Dcommons-text.git;a=3Dtree;f= =3Dsrc/main/java/org/apache/commons/text/similarity;h=3Da2de9f0196b543f50c6= d2c28376feb311f46eeda;hb=3Drefs/heads/SANDBOX-488 > >=C2=A0 ------------------------------ >=C2=A0 *From:* Benedikt Ritter > *To:* Commons Developers List ; Bruno P. > Kinoshita > *Sent:* Friday, December 19, 2014 2:35 AM > > *Subject:* Re: [TEXT] Distance vs. Metric vs. Similarity > > > > 2014-12-14 23:10 GMT+01:00 Bruno P. Kinoshita > : > > > Sounds good, although I'm not sure I understand where you are going > with> the marker interface. What is it's purpose? > Let's then keep the StringMetric interface and update its Javadoc. > Thinking again, that other marker interface seems to be unnecessary.=C2= =A0 > > Okay, but we need to make sure all algorithms really return a > distance> then. As I said, FuzzyDistance currently really returns a > similarity score.> An algorithm returning a distance should return a high= er > number for higher> distances. I had a look at the code, and I think I > understand what you are saying now. In FuzzyDistance, the higher the scor= e, > the closer strings are. Different than what the other algorithms return. > I believe I found why I named that package similarity. Probably it was > because I saw that in the stringmetric library [1]. There, Levenshtein, > Jaccard and other algorithms are suffixed with "Metric". > How about we keep the package as similarity and simply rename the classes > to [Algo]Metric too? This way we will be able to accommodate other metric= s > such as the Sorensen-Dice coefficient, where the higher the coefficient, > more similar two strings are. > WDYT? > > > > Hey Bruno, > > yes we can do it that way. What I want to avoid is, that the users have t= o > check the JavaDoc every time they use an algorithms. To me it would make > sense to have a number of distance algorithms and they all return a > distance. Or we have Similarity algorithms and they all return a > similarity. That way users can swap out the underlying algorithms without > changing their code. > > Benedikt > > > CheersBruno > [1] https://github.com/rockymadden/stringmetric > > > >=C2=A0 =C2=A0 =C2=A0 From: Benedikt Ritter >=C2=A0 To: Commons Developers List ; Bruno P. Kino= shita > >=C2=A0 Sent: Sunday, December 14, 2014 6:45 PM >=C2=A0 Subject: Re: [TEXT] Distance vs. Metric vs. Similarity > > Hi Bruna, > > > > 2014-12-14 21:37 GMT+01:00 Bruno P. Kinoshita >: > > > > Hello Benedikt! > > > Metric feels like it's something more general, but I'm not sure. > > You're right. Metric was supposed to be a general interface, > > representing the String Metric from the Wikipedia article. > > >=C2=A0 and the interface from StringMetric to StringDistance. > > I'm reading the Myers paper, and already have a local branch with the > > Myers algorithm from [collections] ported to [text]. > > Perhaps we could move the StringMetric interface to o.a.c.text package, > > and create StringDistance or EditDistance interface in > o.a.c.text.distance. > > This way we can have String Metrics as in Wikipedia, as being a way of > > giving a valuefor comparing two strings. We would have the edit distanc= es > > in the distance package, and the diff algorithms in another diff packag= e. > > All of them being String Metrics. > > What do you think? > > > > Sounds good, although I'm not sure I understand where you are going with > the marker interface. What is it's purpose? > > > > > > I think we should consider renaming everything to distance, since > > the> > implemented algorithms all end on *Distance. So we would change > the > > package> > name from o.a.c.text.similarity to o.a.c.text.distance and t= he > > interface> > from StringMetric to StringDistance.> >> > > > Looking at the code again, it seems like the algorithms all really > > return a> similarity score and not a distance. For exmaple FuzzyDistanc= e > > JavaDoc> states: "A higher score indicates a higher similarity". If thi= s > is > > a case,> maybe it makes more sense to rename everything to Similarity? > > I'm in favor of dropping score and similarity, and adopting distance in > > the package, classes and javadocs, as it is used in other tools (e.g. > Solr, > > Talend, Informatica IIR, etc). > > > > Okay, but we need to make sure all algorithms really return a distance > then. As I said, FuzzyDistance currently really returns a similarity scor= e. > An algorithm returning a distance should return a higher number for highe= r > distances. > > Benedikt > > > > All the best,Bruno > > > > > >=C2=A0 =C2=A0 =C2=A0 From: Benedikt Ritter > >=C2=A0 To: Commons Developers List > >=C2=A0 Sent: Sunday, December 14, 2014 6:20 PM > >=C2=A0 Subject: Re: [TEXT] Distance vs. Metric vs. Similarity > > > > 2014-12-14 21:08 GMT+01:00 Benedikt Ritter : > > > > > > Hi, > > > > > > currently the wording in commons text is a bit confusing. We have the > > > three terms: > > > > > > - distance > > > - similarity > > > - metric > > > > > > Distance and similarity seem to be just opposites of the same thing. = A > > > great distance indicates a small similarity between two character > > > sequences. Metric feels like it's something more general, but I'm not > > sure. > > > > > > I think we should consider renaming everything to distance, since the > > > implemented algorithms all end on *Distance. So we would change the > > package > > > name from o.a.c.text.similarity to o.a.c.text.distance and the > interface > > > from StringMetric to StringDistance. > > > > > > > Looking at the code again, it seems like the algorithms all really > return a > > similarity score and not a distance. For exmaple FuzzyDistance JavaDoc > > states: "A higher score indicates a higher similarity". If this is a > case, > > maybe it makes more sense to rename everything to Similarity? > > > > > > > > > > WDYT? > > > > > > Benedikt > > > > > > -- > > > http://people.apache.org/~britter/ > > > http://www.systemoutprintln.de/ > > > http://twitter.com/BenediktRitter > > > http://github.com/britter > > > > > > > > > > > > > > > > > > -- > > http://people.apache.org/~britter/ > > http://www.systemoutprintln.de/ > > http://twitter.com/BenediktRitter > > http://github.com/britter > > > > > > > > > > -- > http://people.apache.org/~britter/ > http://www.systemoutprintln.de/ > http://twitter.com/BenediktRitter > http://github.com/britter > > > > > > -- > http://people.apache.org/~britter/ > http://www.systemoutprintln.de/ > http://twitter.com/BenediktRitter > http://github.com/britter > > > --=20 http://people.apache.org/~britter/ http://www.systemoutprintln.de/ http://twitter.com/BenediktRitter http://github.com/britter =20 ------=_Part_1689971_126916281.1429184307575--