commons-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Bruno P. Kinoshita (JIRA)" <j...@apache.org>
Subject [jira] [Created] (TEXT-109) Implement or document how to use edit distances that consider the keyboard layout
Date Sun, 24 Dec 2017 23:47:00 GMT
Bruno P. Kinoshita created TEXT-109:
---------------------------------------

             Summary: Implement or document how to use edit distances that consider the keyboard
layout
                 Key: TEXT-109
                 URL: https://issues.apache.org/jira/browse/TEXT-109
             Project: Commons Text
          Issue Type: New Feature
            Reporter: Bruno P. Kinoshita
            Priority: Minor


Most edit distances take into consideration number of "changes" required in one string to
match with another string. And they give you a value that represent the distance between the
words.

While it is helpful, when working with datasets and corpora that have been created with keyboards
(e.g. SMS, e-mail, transcripts) it is common to have mistakes. In some cases a letter was
accidentally mistyped. But the character used is normally close to the correct character.

For example, given the word "one", and two incorrect misspellings "onr" and "oni". The Levenshtein
distance for both would be 1. But if you are aware that the keyboard layout is English with
the QUERTY layout (notice the E and the R), so the distance between "one" and "onr", would
be greater than the distance between "one" and "oni", because in the English keyboard the
letter 'E' is neighbouring 'R'. Whereas 'I' is not even covered by the left hand, but by the
right hand.

Here's some reference links for further research.

* https://findsomethingnewtoday.wordpress.com/2013/07/20/986/
* https://www.joyofdata.de/blog/comparison-of-string-distance-algorithms/
* http://www.nada.kth.se/~ann/exjobb/axel_samuelsson.pdf
* https://github.com/wsong/Typo-Distance
* https://stackoverflow.com/questions/29233888/edit-distance-such-as-levenshtein-taking-into-account-proximity-on-keyboard

Ideally such edit distance would be extensible to support other keyboard layouts.

There is some indication that perhaps an existing edit distance like levenshtein could be
extended to take into consideration the keyboard layout. So perhaps a new edit distance is
not entirely necessary.

We could come with the the decision that it is too hard to implement, and it would be better
done in a spell checker, or that it would require some statistics and would be out of the
scope of Text. Or we could simply add documentation on how to do it, without adding any code.
Or, perhaps we add a new edit distance.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message