commons-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From GitBox <...@apache.org>
Subject [GitHub] [commons-text] aherbert commented on issue #103: TEXT-126: Adding Sorensen-Dice similarity algoritham
Date Sat, 09 Mar 2019 13:16:31 GMT
aherbert commented on issue #103: TEXT-126: Adding Sorensen-Dice similarity algoritham
URL: https://github.com/apache/commons-text/pull/103#issuecomment-471176120
 
 
   I'll try and summarise where we are:
   
   1. It is not clear what the result should be of 0 / 0
   
   This is the `"" vs ""` case.
   
   Possibles:
   ```
   0 / 0 = n / n = 1
   0 / 0 = n / 0 = NaN
   0 / 0 = 0 / n = 0
   ```
   
   The library I found (python distance library) dealt with this by returning NaN. Others
have been found that return 1 (because they are the same, even if they are the same of nothing).
   
   The current `JaccardSimilarity` has decided to return 0.
   
   The new `SorensenDiceSimilarity` returns 1.
   
   So this discrepancy in our own library should be resolved.
   
   2. characters or bigrams
   
   How do you split up a `CharSequence` to build a `Set`? There are so many ways. Or do you
build a `List`?
   
   There are library examples for splitting a string into a set of characters (python distance
library) and for using bigrams. I note that @kinow example, the python `textdistance` library,
supports using a `Set` or `List`, and supports variable length n-grams:
   
   ```
   >>> textdistance.Sorensen(qval=1, as_set=1)("aaaba","aab")
   1.0
   >>> textdistance.Sorensen(qval=1, as_set=0)("aaaba","aab")
   0.75
   >>> textdistance.Sorensen(qval=2, as_set=1)("aaaba","aab")
   0.8
   >>> textdistance.Sorensen(qval=2, as_set=0)("aaaba","aab")
   0.6666666666666666
   ```
   
   This problem is solved by #109 which allows the user to define how to split the sequence
with a function.
   
   The current `JaccardSimilarity` has decided to split into single characters.
   
   The new `SorensenDiceSimilarity` splits into bigrams.
   
   Both use sets. So this discrepancy should be resolved.
   
   I think the solution is to support an n-gram size argument for the measure with default
of 1. Then support a `useSet` flag with default of `true`. This is all possible using #109
to do the intersect computation. The key is to not break compatibility for anyone already
using the `JaccardSimilarity`.
   
   3. How to score a `CharSequence` using bigrams with only 1 character
   
   ```
   "a" vs "a"  = 0   or   1?
   "a" vs "b"  = 0   or   NaN?
   ```
   
   This is similar to point 1 where you are actually scoring 0/0 again since you have no bigrams.
   
   4. Handling null
   
   Currently this just throws an exception. However `null` is similar to `null`.
   
   So if the library decides to not support `null` because it does not exists where does it
stand on a zero length `CharSequence`. This does also not exist, but does have established
meaning in the context of strings.
   
   Currently the stance in the Jaccard is it is a programming error to pass around `null`
and expect comparisons but it is perfectly reasonable to pass around empty stuff.
   
   
   **Discussion**
   
   To consolidate 0, 1 and 4 perhaps we can change the library to state that:
   
   - Similarity is undefined for `null` and throw an exception
   - Similarity between two empty sequences is either: (a) exception; (b) perfect; or (c)
nothing
   - Similarity between equal sequences that do not satisfy the criteria for the comparison
algorithm is either: (a) perfect; or (b) nothing
   
   Then support n-gram size and both the `Set` or `List` approach to defining the two collections
of n-grams to compare.
   
   This involves some decisions.
   
   My vote is:
   
   - Add n-gram and useSet parameters to the `JaccardSimilarity`
   - Add n-gram and useSet parameters to the `SorensenDiceSimilarity`
   - Make similarity 1 if the sequence does not satisfy the criteria for the algorithm but
is identical. This is less destructive than throwing an exception or returning NaN
   
   As long as the Javadoc explains the chosen functionality then the I would be happy with
any of the options. But my preference would be for equality being similar:
   
   ```
   "" vs "" == 1 (using a character comparison algorithm)
   "a" vs "a" == 1 (using a bigram algorithm)
   ```
   
   This is in-line with other similarity measures in the library which state equal sequences
are perfectly similar.
   
   Implementing a more generic approach can be done using methods in #109. So perhaps hold
this PR until that is resolved.
   
   Alex
   
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

Mime
View raw message