asterixdb-notifications mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Taewoo Kim (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (ASTERIXDB-1778) Calculating the edit-distance in SimilarityMetricEditDistance class can be improved.
Date Wed, 01 Feb 2017 06:09:52 GMT

    [ https://issues.apache.org/jira/browse/ASTERIXDB-1778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15848036#comment-15848036
] 

Taewoo Kim edited comment on ASTERIXDB-1778 at 2/1/17 6:09 AM:
---------------------------------------------------------------

Got a hint from http://stackoverflow.com/questions/9453731/how-to-calculate-distance-similarity-measure-of-given-2-strings/9454016#9454016:

Here are the revised method:

Basic Algorithm:
For two strings X and Y, 

Init:
Construct D(len(X)+1=M, len(Y)+1=N).
D(i,0) = i
D(j,0) = j
T = Given Threshold

Iteration:
{code}
For each i = 1 to M
    MinDistance = INT.MAX_VALUE
    For each j = 1 to N:
        D(i,j) = min( D(i-1,j) + 1 // deletion case
                      D(i,j-1) + 1 // insertion case
                      D(i-1,j-1) + cost (cost is 1 if X( i ) != Y(j), 
                                                 cost is 0 if X( i ) = Y(j))
                    )
         If (D(i,j) < MinDistance) 
               MinDistance = D(i,j)
     // After processing each row,
     If (MinDistance > T)
         return -1
{code}
     
Result:                    
D(M,N) is the edit distance value. -1 is returned in case when it becomes obvious that the
final edit-distance will be greater than T.

Early Termination condition:
So, for the given threshold T, we can stop the computation early if the min value among all
possible values from each row is greater than the values of three cases are greater than T.



was (Author: wangsaeu):
Got a hint from http://stackoverflow.com/questions/9453731/how-to-calculate-distance-similarity-measure-of-given-2-strings/9454016#9454016:

Here are the revised method:

Basic Algorithm:
For two strings X and Y, 

Init:
Construct D(len(X)+1=M, len(Y)+1=N).
D(i,0) = i
D(j,0) = j
T = Given Threshold

Iteration:
For each i = 1 to M
    MinDistance = INT.MAX_VALUE
    For each j = 1 to N:
        D(i,j) = min( D(i-1,j) + 1 // deletion case
                      D(i,j-1) + 1 // insertion case
                      D(i-1,j-1) + cost (cost is 1 if X(i) != Y(j), 
                                                 cost is 0 if X(i) = Y(j))
                    )
         If (D(i,j) < MinDistance) 
               MinDistance = D(i,j)
     // After processing each row,
     If (MinDistance > T)
         return -1
     
Result:                    
D(M,N) is the edit distance value. -1 is returned in case when it becomes obvious that the
final edit-distance will be greater than T.

Early Termination condition:
So, for the given threshold T, we can stop the computation early if the min value among all
possible values from each row is greater than the values of three cases are greater than T.


> Calculating the edit-distance in SimilarityMetricEditDistance class can be improved.
> ------------------------------------------------------------------------------------
>
>                 Key: ASTERIXDB-1778
>                 URL: https://issues.apache.org/jira/browse/ASTERIXDB-1778
>             Project: Apache AsterixDB
>          Issue Type: Improvement
>            Reporter: Taewoo Kim
>            Priority: Minor
>
> Local optimization of the current edit-distance handling code in AsterixDB: Can we terminate
the edit distance calculation based on a given edit-distance threshold T? I think the answer
is yes.
> Basic Algorithm:
> For two strings X and Y, 
> Init:
> Construct D(len(X)+1=M, len(Y)+1=N).
> D(i,0) = i
> D(j,0) = j
> Iteration:
> For each i = 1 to M
>     For each j = 1 to N
>         D(i,j) = min( D(i-1,j) + 1 // deletion case
>                       D(i,j-1) + 1 // insertion case
>                       D(i-1,j-1) + cost (cost is 2 if X(i) != Y(j), 
>                                                  cost is 0 if X(i) = Y(j))
>                     )
> Result:                    
> D(M,N) is the edit distance value.
> Example:
> Early Termination condition:
> So, for the given threshold T, we can stop the computation early if the values of three
cases are greater than T. That is,
> min (D(i-1,j), D(i,j-1), D(i-1,j-1)) > T
> This holds since if the cost of all possible cases (insertion, deletion, and substitution)
is greater than T, all future operations will be greater than T in any cases.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Mime
View raw message