commons-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Rob Tompkins (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (TEXT-109) Implement or document how to use edit distances that consider the keyboard layout
Date Tue, 13 Feb 2018 02:13:00 GMT

    [ https://issues.apache.org/jira/browse/TEXT-109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16361719#comment-16361719
] 

Rob Tompkins commented on TEXT-109:
-----------------------------------

It would seem that we would want some flavor of a "weighted" edit distance here. Where upon
each edit, you consider the keyboard distance between the keys. For this we would clearly
have to build keyboards as constants. Let me fiddle around with the ideas some and I might
be able to come up with something interesting. My thought is to try coming up with something
and then cross checking it with the other ideas in the space listed above. 

> Implement or document how to use edit distances that consider the keyboard layout
> ---------------------------------------------------------------------------------
>
>                 Key: TEXT-109
>                 URL: https://issues.apache.org/jira/browse/TEXT-109
>             Project: Commons Text
>          Issue Type: New Feature
>            Reporter: Bruno P. Kinoshita
>            Priority: Minor
>              Labels: discussion, edit-distance, help-wanted
>
> Most edit distances take into consideration number of "changes" required in one string
to match with another string. And they give you a value that represent the distance between
the words.
> While it is helpful, when working with datasets and corpora that have been created with
keyboards (e.g. SMS, e-mail, transcripts) it is common to have mistakes. In some cases a letter
was accidentally mistyped. But the character used is normally close to the correct character.
> For example, given the word "one", and two incorrect misspellings "onr" and "oni". The
Levenshtein distance for both would be 1. But if you are aware that the keyboard layout is
English with the QUERTY layout (notice the E and the R), so the distance between "one" and
"onr", would be greater than the distance between "one" and "oni", because in the English
keyboard the letter 'E' is neighbouring 'R'. Whereas 'I' is not even covered by the left hand,
but by the right hand.
> Here's some reference links for further research.
> * https://findsomethingnewtoday.wordpress.com/2013/07/20/986/
> * https://www.joyofdata.de/blog/comparison-of-string-distance-algorithms/
> * http://www.nada.kth.se/~ann/exjobb/axel_samuelsson.pdf
> * https://github.com/wsong/Typo-Distance
> * https://stackoverflow.com/questions/29233888/edit-distance-such-as-levenshtein-taking-into-account-proximity-on-keyboard
> Ideally such edit distance would be extensible to support other keyboard layouts.
> There is some indication that perhaps an existing edit distance like levenshtein could
be extended to take into consideration the keyboard layout. So perhaps a new edit distance
is not entirely necessary.
> We could come with the the decision that it is too hard to implement, and it would be
better done in a spell checker, or that it would require some statistics and would be out
of the scope of Text. Or we could simply add documentation on how to do it, without adding
any code. Or, perhaps we add a new edit distance.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message