hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Alexander Pivovarov (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HIVE-9556) create UDF to measure strings similarity using Levenshtein Distance algo
Date Thu, 12 Feb 2015 03:43:11 GMT

    [ https://issues.apache.org/jira/browse/HIVE-9556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14317524#comment-14317524
] 

Alexander Pivovarov commented on HIVE-9556:
-------------------------------------------

String similarity functions can be used to find fraud activity. e.g. person registers with
slightly different names - "Alexander" vs "Alexandre"
Also it can be used to find the same addresses. "110 Rock Harbor ln" vs "110 Rock harbour
Lane"

Oracle has function SOUNDEX to find strings which sound similar

Postgres has
- soundex
- difference
- levenshtein   // returns int instead of double
- -metaphone
- dmetaphone

http://www.postgresql.org/docs/9.1/static/fuzzystrmatch.html

Strings similarity function might be useful if people migrate from Oracle or from Postgres
to Hive.
If people work with accounts, names, addresses, medical records, etc they can find strings
similarity functions extremely useful.
Strings similarity functions can be used by Data Scientists as well.

Levenshtein distance is included to Apache Commons Lang StringUtils.getLevenshteinDistance()
which is standard library found in most of java projects

It would be nice to have Levenshtein Distance in Hive as well

> create UDF to measure strings similarity using Levenshtein Distance algo
> ------------------------------------------------------------------------
>
>                 Key: HIVE-9556
>                 URL: https://issues.apache.org/jira/browse/HIVE-9556
>             Project: Hive
>          Issue Type: Improvement
>          Components: UDF
>            Reporter: Alexander Pivovarov
>            Assignee: Alexander Pivovarov
>         Attachments: HIVE-9556.1.patch, HIVE-9556.2.patch
>
>
> algorithm description http://en.wikipedia.org/wiki/Levenshtein_distance
> {code}
> --one edit operation, greatest str len = 12
> str_sim_levenshtein('Test String1', 'Test String2') = 1 - 1 / 12 = 0.91666667
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message