[ https://issues.apache.org/jira/browse/ASTERIXDB1778?page=com.atlassian.jira.plugin.system.issuetabpanels:commenttabpanel&focusedCommentId=15848036#comment15848036
]
Taewoo Kim edited comment on ASTERIXDB1778 at 2/1/17 6:09 AM:

Got a hint from http://stackoverflow.com/questions/9453731/howtocalculatedistancesimilaritymeasureofgiven2strings/9454016#9454016:
Here are the revised method:
Basic Algorithm:
For two strings X and Y,
Init:
Construct D(len(X)+1=M, len(Y)+1=N).
D(i,0) = i
D(j,0) = j
T = Given Threshold
Iteration:
{code}
For each i = 1 to M
MinDistance = INT.MAX_VALUE
For each j = 1 to N:
D(i,j) = min( D(i1,j) + 1 // deletion case
D(i,j1) + 1 // insertion case
D(i1,j1) + cost (cost is 1 if X( i ) != Y(j),
cost is 0 if X( i ) = Y(j))
)
If (D(i,j) < MinDistance)
MinDistance = D(i,j)
// After processing each row,
If (MinDistance > T)
return 1
{code}
Result:
D(M,N) is the edit distance value. 1 is returned in case when it becomes obvious that the
final editdistance will be greater than T.
Early Termination condition:
So, for the given threshold T, we can stop the computation early if the min value among all
possible values from each row is greater than the values of three cases are greater than T.
was (Author: wangsaeu):
Got a hint from http://stackoverflow.com/questions/9453731/howtocalculatedistancesimilaritymeasureofgiven2strings/9454016#9454016:
Here are the revised method:
Basic Algorithm:
For two strings X and Y,
Init:
Construct D(len(X)+1=M, len(Y)+1=N).
D(i,0) = i
D(j,0) = j
T = Given Threshold
Iteration:
For each i = 1 to M
MinDistance = INT.MAX_VALUE
For each j = 1 to N:
D(i,j) = min( D(i1,j) + 1 // deletion case
D(i,j1) + 1 // insertion case
D(i1,j1) + cost (cost is 1 if X(i) != Y(j),
cost is 0 if X(i) = Y(j))
)
If (D(i,j) < MinDistance)
MinDistance = D(i,j)
// After processing each row,
If (MinDistance > T)
return 1
Result:
D(M,N) is the edit distance value. 1 is returned in case when it becomes obvious that the
final editdistance will be greater than T.
Early Termination condition:
So, for the given threshold T, we can stop the computation early if the min value among all
possible values from each row is greater than the values of three cases are greater than T.
> Calculating the editdistance in SimilarityMetricEditDistance class can be improved.
> 
>
> Key: ASTERIXDB1778
> URL: https://issues.apache.org/jira/browse/ASTERIXDB1778
> Project: Apache AsterixDB
> Issue Type: Improvement
> Reporter: Taewoo Kim
> Priority: Minor
>
> Local optimization of the current editdistance handling code in AsterixDB: Can we terminate
the edit distance calculation based on a given editdistance threshold T? I think the answer
is yes.
> Basic Algorithm:
> For two strings X and Y,
> Init:
> Construct D(len(X)+1=M, len(Y)+1=N).
> D(i,0) = i
> D(j,0) = j
> Iteration:
> For each i = 1 to M
> For each j = 1 to N
> D(i,j) = min( D(i1,j) + 1 // deletion case
> D(i,j1) + 1 // insertion case
> D(i1,j1) + cost (cost is 2 if X(i) != Y(j),
> cost is 0 if X(i) = Y(j))
> )
> Result:
> D(M,N) is the edit distance value.
> Example:
> Early Termination condition:
> So, for the given threshold T, we can stop the computation early if the values of three
cases are greater than T. That is,
> min (D(i1,j), D(i,j1), D(i1,j1)) > T
> This holds since if the cost of all possible cases (insertion, deletion, and substitution)
is greater than T, all future operations will be greater than T in any cases.

This message was sent by Atlassian JIRA
(v6.3.15#6346)
