commons-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (JIRA)" <j...@apache.org>
Subject [jira] [Work logged] (TEXT-157) Remove rounding from JaccardSimilarity and Distance to improve ranking
Date Fri, 08 Mar 2019 12:20:00 GMT

     [ https://issues.apache.org/jira/browse/TEXT-157?focusedWorklogId=210108&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-210108
]

ASF GitHub Bot logged work on TEXT-157:
---------------------------------------

                Author: ASF GitHub Bot
            Created on: 08/Mar/19 12:19
            Start Date: 08/Mar/19 12:19
    Worklog Time Spent: 10m 
      Work Description: kinow commented on pull request #111: TEXT-157: Remove rounding from
JaccardSimilarity and Distance
URL: https://github.com/apache/commons-text/pull/111#discussion_r263759006
 
 

 ##########
 File path: src/test/java/org/apache/commons/text/similarity/JaccardDistanceTest.java
 ##########
 @@ -36,21 +36,23 @@ public static void setUp() {
 
     @Test
     public void testGettingJaccardDistance() {
-        assertEquals(1.00d, classBeingTested.apply("", ""), 0.00000000000000000001d);
-        assertEquals(1.00d, classBeingTested.apply("left", ""), 0.00000000000000000001d);
-        assertEquals(1.00d, classBeingTested.apply("", "right"), 0.00000000000000000001d);
-        assertEquals(0.25d, classBeingTested.apply("frog", "fog"), 0.00000000000000000001d);
-        assertEquals(1.00d, classBeingTested.apply("fly", "ant"), 0.00000000000000000001d);
-        assertEquals(0.78d, classBeingTested.apply("elephant", "hippo"), 0.00000000000000000001d);
-        assertEquals(0.36d, classBeingTested.apply("ABC Corporation", "ABC Corp"), 0.00000000000000000001d);
-        assertEquals(0.24d, classBeingTested.apply("D N H Enterprises Inc", "D & H Enterprises,
Inc."),
-                0.00000000000000000001d);
-        assertEquals(0.11d, classBeingTested.apply("My Gym Children's Fitness Center", "My
Gym. Childrens Fitness"),
-                0.00000000000000000001d);
-        assertEquals(0.10d, classBeingTested.apply("PENNSYLVANIA", "PENNCISYLVNIA"), 0.00000000000000000001d);
-        assertEquals(0.87d, classBeingTested.apply("left", "right"), 0.00000000000000000001d);
-        assertEquals(0.87d, classBeingTested.apply("leettteft", "ritttght"), 0.00000000000000000001d);
-        assertEquals(0.0d, classBeingTested.apply("the same string", "the same string"),
0.00000000000000000001d);
+        // Results generated using the python distance library using:
+        // distance.jaccard(seq1, seq2)
+        assertEquals(1.0, classBeingTested.apply("", ""));
+        assertEquals(1.0, classBeingTested.apply("left", ""));
+        assertEquals(1.0, classBeingTested.apply("", "right"));
+        assertEquals(0.25, classBeingTested.apply("frog", "fog"));
+        assertEquals(1.0, classBeingTested.apply("fly", "ant"));
+        assertEquals(0.7777777777777778, classBeingTested.apply("elephant", "hippo"));
 
 Review comment:
   >How about a comment in the test explaining where each value comes from, or even the
actual computation
   
   I will take your word here :-) let's leave it like this for now then. If it eventually
fails in a JVM, we can either add that episillon where appropriate, or the comment, or the
explicit calculation (liked this last one, never occurred me to test that way!).
   
   :+1: from me. And travis is happy too. Up to you to merge it now or wait for others to
review :)
 
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Issue Time Tracking
-------------------

    Worklog Id:     (was: 210108)
    Time Spent: 50m  (was: 40m)

> Remove rounding from JaccardSimilarity and Distance to improve ranking
> ----------------------------------------------------------------------
>
>                 Key: TEXT-157
>                 URL: https://issues.apache.org/jira/browse/TEXT-157
>             Project: Commons Text
>          Issue Type: Improvement
>    Affects Versions: 1.6
>            Reporter: Alex D Herbert
>            Assignee: Alex D Herbert
>            Priority: Trivial
>          Time Spent: 50m
>  Remaining Estimate: 0h
>
> The {{JaccardSimilarity}} uses rounding to 2 decimal places. This prevents ranking of
dissimilar sequences of even moderately short length.
> Using sequences with 1 or 2 characters in common and the remaining characters are different:
> {noformat}
>  2 0.500000 1.000000 : aa vs (ab or aa)
>  3 0.250000 0.330000 : aaD vs (abd or aaÀ)
>  4 0.170000 0.200000 : aaDE vs (abde or aaÀÁ)
>  5 0.130000 0.140000 : aaDEF vs (abdef or aaÀÁÂ)
>  6 0.100000 0.110000 : aaDEFG vs (abdefg or aaÀÁÂÃ)
>  7 0.080000 0.090000 : aaDEFGH vs (abdefgh or aaÀÁÂÃÄ)
>  8 0.070000 0.080000 : aaDEFGHI vs (abdefghi or aaÀÁÂÃÄÅ)
>  9 0.060000 0.070000 : aaDEFGHIJ vs (abdefghij or aaÀÁÂÃÄÅÆ)
> 10 0.060000 0.060000 : aaDEFGHIJK vs (abdefghijk or aaÀÁÂÃÄÅÆÇ)
> {noformat}
> Without rounding the scores are different where previously rounding had produced the
same score. This will improve ranking:
> {noformat}
>  2 0.500000 1.000000 : aa vs (ab or aa)
>  3 0.250000 0.333333 : aaD vs (abd or aaÀ)
>  4 0.166667 0.200000 : aaDE vs (abde or aaÀÁ)
>  5 0.125000 0.142857 : aaDEF vs (abdef or aaÀÁÂ)
>  6 0.100000 0.111111 : aaDEFG vs (abdefg or aaÀÁÂÃ)
>  7 0.083333 0.090909 : aaDEFGH vs (abdefgh or aaÀÁÂÃÄ)
>  8 0.071429 0.076923 : aaDEFGHI vs (abdefghi or aaÀÁÂÃÄÅ)
>  9 0.062500 0.066667 : aaDEFGHIJ vs (abdefghij or aaÀÁÂÃÄÅÆ)
> 10 0.055556 0.058824 : aaDEFGHIJK vs (abdefghijk or aaÀÁÂÃÄÅÆÇ)
> {noformat}
>  Generated using:
> {code:java}
> @Test
> public void roundingDemo() {
>     // First character of each dissimilar sequence.
>     // Chosen for a nice output where we already know the loop
>     // will exit before sequence overlap.
>     char ch1 = 'D';
>     char ch2 = 'd';
>     char ch3 = 0x00c0;
>     // 1 or 2 characters in common
>     StringBuilder sb1 = new StringBuilder("aa");
>     StringBuilder sb2 = new StringBuilder("ab"); // 1 in common
>     StringBuilder sb3 = new StringBuilder("aa"); // 2 in common
>     JaccardSimilarity similarity = new JaccardSimilarity();
>     // Extend the sequences until a single/double character 
>     // similarity cannot be detected
>     double j1, j2;
>     do  {
>         j1 = similarity.apply(sb1, sb2);
>         j2 = similarity.apply(sb1, sb3);
>         System.out.printf("%2d %f %f : %s vs (%s or %s)%n", 
>                           sb1.length(), j1, j2, sb1, sb2, sb3);
>         // Extend the sequence using unique characters for each
>         sb1.append(ch1++);
>         sb2.append(ch2++);
>         sb3.append(ch3++);
>         // Note: Check length since the sequences will overlap
>         // in case the rounding is not present
>     } while (j1 != j2 && sb1.length() < 26); 
> }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message