commons-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (JIRA)" <j...@apache.org>
Subject [jira] [Work logged] (TEXT-126) Dice's Coefficient Algorithm in String similarity
Date Sat, 09 Mar 2019 04:26:00 GMT

     [ https://issues.apache.org/jira/browse/TEXT-126?focusedWorklogId=210462&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-210462
]

ASF GitHub Bot logged work on TEXT-126:
---------------------------------------

                Author: ASF GitHub Bot
            Created on: 09/Mar/19 04:25
            Start Date: 09/Mar/19 04:25
    Worklog Time Spent: 10m 
      Work Description: kinow commented on issue #103: TEXT-126: Adding Sorensen-Dice similarity
algoritham
URL: https://github.com/apache/commons-text/pull/103#issuecomment-471144174
 
 
   @ameyjadiye see last comment from @aherbert about empty strings and `0` vs. `1`.
   
   @aherbert while we are discussing #109 , do you think that is a blocker for this pull request?
So far I think at least the API proposed here would be kept right?
   
   If so, this could be merged once the last comment is resolved, and then we can discuss
how to organize the classes and where the sorensen-dice coefficient is calculated.
   
   I think the only thing missing is deciding on the name of the classes? Whether it should
use `Bigram` in the name or be just `SorensenDiceSimilarity`.
   
   I like the idea of having a descriptive name such as `BigramSorensenDiceSimilarity` (or
`Bigram` in other place/order). However, I think we should also considerate what users would
expect. i.e. in other libraries, does the Sorensen Dice similarity used is for bigrams always?
If other implementations Python/JS/Java in used bigrams, then we could leave it as `SorensenDiceSimilarity`
and either add another method/constructor/etc to customize the similarity, or then have another
class...
   
   What do you think? (@ameyjadiye if you have any suggestion, feel free to chime in too :+1:
[or any other person reading this :slightly_smiling_face: ])
 
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Issue Time Tracking
-------------------

    Worklog Id:     (was: 210462)
    Time Spent: 9h 50m  (was: 9h 40m)

> Dice's Coefficient Algorithm in String similarity
> -------------------------------------------------
>
>                 Key: TEXT-126
>                 URL: https://issues.apache.org/jira/browse/TEXT-126
>             Project: Commons Text
>          Issue Type: Improvement
>            Reporter: Vicky Chawda
>            Priority: Major
>          Time Spent: 9h 50m
>  Remaining Estimate: 0h
>
> I'd like to propose an extension to the algorithms for string similarity in *commons-text/src/main/java/org/apache/commons/text/similarity/*
>  Dice's Coefficient Algorithm can be helpful for many who are looking for ranking similarities
in strings.
> *Inspired from* - [http://www.catalysoft.com/articles/StrikeAMatch.html]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message