commons-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (JIRA)" <j...@apache.org>
Subject [jira] [Work logged] (TEXT-126) Dice's Coefficient Algorithm in String similarity
Date Sun, 24 Mar 2019 15:07:00 GMT

     [ https://issues.apache.org/jira/browse/TEXT-126?focusedWorklogId=217722&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-217722
]

ASF GitHub Bot logged work on TEXT-126:
---------------------------------------

                Author: ASF GitHub Bot
            Created on: 24/Mar/19 15:06
            Start Date: 24/Mar/19 15:06
    Worklog Time Spent: 10m 
      Work Description: aherbert commented on issue #103: TEXT-126: Adding Sorensen-Dice similarity
algoritham
URL: https://github.com/apache/commons-text/pull/103#issuecomment-475967738
 
 
   Overall this class now works nicely. But I think that my initial reservations have not
been met.
   
   There is no reason that this should use bigrams, or allow duplicate bigrams.
   
   If this goes through then you end up with a library that can compute the Jaccard with unique
single characters (by using a `Set`) and the Sorensen with bigrams including duplicates (by
using a `List` converted to a `Bag`). Since the two scores are related by the equations:
   
   ```
   J = S / (2 - S)
   S = 2J / (1 + J)
   ```
   
   It really does not make sense to compute one with a different method from the other. 
   
   I would actually state we have:
   
   - J1U and S2D
   
   Where the number is the number of characters (n-gram), U is for unique (non-duplicates),
and D is for duplicates.
   
   We actually have possibilities for the following in the library:
   
   - J1U
   - J1D
   - J2U
   - J2D
   - S1U
   - S1D
   - S2U
   - S2D
   
   Or generically: [JS]N[UD]
   
   Putting in an S2D metric to complement a J1U metric makes a confusing and inconsistent
library.
   
   BTW. I believe bigrams is better and duplicates is better. So this is the best algorithm.
It just totally contradicts the current Jaccard implementation.
   
   A cleaner design is to have both metrics support single or bigrams (or even n-grams) and
also allow or prevent duplicates via constructor arguments.
   
   The current Jaccard can be modified to do so by allowing the default constructor to default
to single, non-duplicate mode and overloaded constructors added to it.
   
   This idea may be best achieved by pushing shared functionality into a `NGramIntersectionResult`
class that accepts the size of the n-gram and the duplicates option and computes the IntersectionResult.
This can internally use an `IntersectionSimilarity` class to do the computation.
   
   ```
   class NGramIntersectionResult {
   
       NGramIntersectionResult(int ngram, boolean duplicates) {
            // ...
       }
   
       IntersectionResult compute(CharSequence cs1, CharSequence cs2) {
            // Use IntersectionSimilarity to compute the IntersectionResult:
            // Pass to the IntersectionSimilarity the function to split the CharSequence into

            // the appropriate collection (Set or List) using n-grams
            //
            // All the handling of the generic type (Character, Integer, String) used for
the n-gram
            // should be done in here and hidden from the API.
       }
   }
   ```
   
   This can be made more generic if necessary. But this then leads to the library that @kinow
identified ([Simetrics](https://github.com/Simmetrics/simmetrics)) which can support tokenisation
of input, case handling, splitting to n-grams, handling duplicates, extended unicode character
handling, etc. 
   
   This level of generalisation is out of scope for the current similarity package. But n-grams
and duplicates should be in scope because this is the difference between this algorithm for
S2D and the current library J1U algorithm.
   
   @ameyjadiye Would you like to add support for n-grams and unique/duplicate matching to
this or attempt to add a package scope class that can be shared with the JaccardSimilarity?
   
 
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Issue Time Tracking
-------------------

    Worklog Id:     (was: 217722)
    Time Spent: 14h 40m  (was: 14.5h)

> Dice's Coefficient Algorithm in String similarity
> -------------------------------------------------
>
>                 Key: TEXT-126
>                 URL: https://issues.apache.org/jira/browse/TEXT-126
>             Project: Commons Text
>          Issue Type: Improvement
>            Reporter: Vicky Chawda
>            Priority: Major
>          Time Spent: 14h 40m
>  Remaining Estimate: 0h
>
> I'd like to propose an extension to the algorithms for string similarity in *commons-text/src/main/java/org/apache/commons/text/similarity/*
>  Dice's Coefficient Algorithm can be helpful for many who are looking for ranking similarities
in strings.
> *Inspired from* - [http://www.catalysoft.com/articles/StrikeAMatch.html]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message