commons-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (JIRA)" <j...@apache.org>
Subject [jira] [Work logged] (TEXT-157) Remove rounding from JaccardSimilarity and Distance to improve ranking
Date Fri, 08 Mar 2019 11:52:00 GMT

     [ https://issues.apache.org/jira/browse/TEXT-157?focusedWorklogId=210098&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-210098
]

ASF GitHub Bot logged work on TEXT-157:
---------------------------------------

                Author: ASF GitHub Bot
            Created on: 08/Mar/19 11:51
            Start Date: 08/Mar/19 11:51
    Worklog Time Spent: 10m 
      Work Description: aherbert commented on pull request #111: TEXT-157: Remove rounding
from JaccardSimilarity and Distance
URL: https://github.com/apache/commons-text/pull/111
 
 
   The rounding to 2 decimal places in the JaccardSimilarity prevents correct ranking of moderate
length dissimilar strings.
   
 
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Issue Time Tracking
-------------------

            Worklog Id:     (was: 210098)
            Time Spent: 10m
    Remaining Estimate: 0h

> Remove rounding from JaccardSimilarity and Distance to improve ranking
> ----------------------------------------------------------------------
>
>                 Key: TEXT-157
>                 URL: https://issues.apache.org/jira/browse/TEXT-157
>             Project: Commons Text
>          Issue Type: Improvement
>    Affects Versions: 1.6
>            Reporter: Alex D Herbert
>            Assignee: Alex D Herbert
>            Priority: Trivial
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> The {{JaccardSimilarity}} uses rounding to 2 decimal places. This prevents ranking of
dissimilar sequences of even moderately short length.
> Using sequences with 1 or 2 characters in common and the remaining characters are different:
> {noformat}
>  2 0.500000 1.000000 : aa vs (ab or aa)
>  3 0.250000 0.330000 : aaD vs (abd or aaÀ)
>  4 0.170000 0.200000 : aaDE vs (abde or aaÀÁ)
>  5 0.130000 0.140000 : aaDEF vs (abdef or aaÀÁÂ)
>  6 0.100000 0.110000 : aaDEFG vs (abdefg or aaÀÁÂÃ)
>  7 0.080000 0.090000 : aaDEFGH vs (abdefgh or aaÀÁÂÃÄ)
>  8 0.070000 0.080000 : aaDEFGHI vs (abdefghi or aaÀÁÂÃÄÅ)
>  9 0.060000 0.070000 : aaDEFGHIJ vs (abdefghij or aaÀÁÂÃÄÅÆ)
> 10 0.060000 0.060000 : aaDEFGHIJK vs (abdefghijk or aaÀÁÂÃÄÅÆÇ)
> {noformat}
> Without rounding the scores are different where previously rounding had produced the
same score. This will improve ranking:
> {noformat}
>  2 0.500000 1.000000 : aa vs (ab or aa)
>  3 0.250000 0.333333 : aaD vs (abd or aaÀ)
>  4 0.166667 0.200000 : aaDE vs (abde or aaÀÁ)
>  5 0.125000 0.142857 : aaDEF vs (abdef or aaÀÁÂ)
>  6 0.100000 0.111111 : aaDEFG vs (abdefg or aaÀÁÂÃ)
>  7 0.083333 0.090909 : aaDEFGH vs (abdefgh or aaÀÁÂÃÄ)
>  8 0.071429 0.076923 : aaDEFGHI vs (abdefghi or aaÀÁÂÃÄÅ)
>  9 0.062500 0.066667 : aaDEFGHIJ vs (abdefghij or aaÀÁÂÃÄÅÆ)
> 10 0.055556 0.058824 : aaDEFGHIJK vs (abdefghijk or aaÀÁÂÃÄÅÆÇ)
> {noformat}
>  Generated using:
> {code:java}
> @Test
> public void roundingDemo() {
>     // First character of each dissimilar sequence.
>     // Chosen for a nice output where we already know the loop
>     // will exit before sequence overlap.
>     char ch1 = 'D';
>     char ch2 = 'd';
>     char ch3 = 0x00c0;
>     // 1 or 2 characters in common
>     StringBuilder sb1 = new StringBuilder("aa");
>     StringBuilder sb2 = new StringBuilder("ab"); // 1 in common
>     StringBuilder sb3 = new StringBuilder("aa"); // 2 in common
>     JaccardSimilarity similarity = new JaccardSimilarity();
>     // Extend the sequences until a single/double character 
>     // similarity cannot be detected
>     double j1, j2;
>     do  {
>         j1 = similarity.apply(sb1, sb2);
>         j2 = similarity.apply(sb1, sb3);
>         System.out.printf("%2d %f %f : %s vs (%s or %s)%n", 
>                           sb1.length(), j1, j2, sb1, sb2, sb3);
>         // Extend the sequence using unique characters for each
>         sb1.append(ch1++);
>         sb2.append(ch2++);
>         sb3.append(ch3++);
>         // Note: Check length since the sequences will overlap
>         // in case the rounding is not present
>     } while (j1 != j2 && sb1.length() < 26); 
> }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message