hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Alexander Pivovarov (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (HIVE-9559) Create UDF to measure strings similarity using q-gram distance algo
Date Tue, 03 Feb 2015 06:24:35 GMT

     [ https://issues.apache.org/jira/browse/HIVE-9559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Alexander Pivovarov updated HIVE-9559:
--------------------------------------
    Description: 
algo description http://stackoverflow.com/questions/1938678/q-gram-approximate-matching-optimisations

{code}
str_sim_qgrams('Test String1', 'Test String2') = 0.78571427f
{code}

another example
{code}
> qgrams('abcde','abdcde',q=2)
   ab bc cd de dc bd
V1  1  1  1  1  0  0
V2  1  0  1  1  1  1
 
> stringdist('abcde', 'abdcde', method='qgram', q=2)
[1] 3
{code}

take SimMetrics as a reference implementation 
https://github.com/Simmetrics/simmetrics/blob/master/src/uk/ac/shef/wit/simmetrics/similaritymetrics/QGramsDistance.java
https://github.com/Simmetrics/simmetrics/blob/master/src/uk/ac/shef/wit/simmetrics/similaritymetrics/QGramsDistanceTest.java

  was:
algo description http://stackoverflow.com/questions/1938678/q-gram-approximate-matching-optimisations

{code}
str_sim_qgrams("Test String1", "Test String2") = 0.78571427f
{code}

another example
{code}
> qgrams('abcde','abdcde',q=2)
   ab bc cd de dc bd
V1  1  1  1  1  0  0
V2  1  0  1  1  1  1
 
> stringdist('abcde', 'abdcde', method='qgram', q=2)
[1] 3
{code}

take SimMetrics as a reference implementation 
https://github.com/Simmetrics/simmetrics/blob/master/src/uk/ac/shef/wit/simmetrics/similaritymetrics/QGramsDistance.java
https://github.com/Simmetrics/simmetrics/blob/master/src/uk/ac/shef/wit/simmetrics/similaritymetrics/QGramsDistanceTest.java


> Create UDF to measure strings similarity using q-gram distance algo
> -------------------------------------------------------------------
>
>                 Key: HIVE-9559
>                 URL: https://issues.apache.org/jira/browse/HIVE-9559
>             Project: Hive
>          Issue Type: Improvement
>          Components: UDF
>            Reporter: Alexander Pivovarov
>            Assignee: Alexander Pivovarov
>
> algo description http://stackoverflow.com/questions/1938678/q-gram-approximate-matching-optimisations
> {code}
> str_sim_qgrams('Test String1', 'Test String2') = 0.78571427f
> {code}
> another example
> {code}
> > qgrams('abcde','abdcde',q=2)
>    ab bc cd de dc bd
> V1  1  1  1  1  0  0
> V2  1  0  1  1  1  1
>  
> > stringdist('abcde', 'abdcde', method='qgram', q=2)
> [1] 3
> {code}
> take SimMetrics as a reference implementation 
> https://github.com/Simmetrics/simmetrics/blob/master/src/uk/ac/shef/wit/simmetrics/similaritymetrics/QGramsDistance.java
> https://github.com/Simmetrics/simmetrics/blob/master/src/uk/ac/shef/wit/simmetrics/similaritymetrics/QGramsDistanceTest.java



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message