[ https://issues.apache.org/jira/browse/ASTERIXDB1778?page=com.atlassian.jira.plugin.system.issuetabpanels:alltabpanel
]
Michael J. Carey updated ASTERIXDB1778:

Do we want that change, or should the optimized version be a differently
named/used function that's invoked under the hood?
> Calculating the editdistance in SimilarityMetricEditDistance class can be improved.
> 
>
> Key: ASTERIXDB1778
> URL: https://issues.apache.org/jira/browse/ASTERIXDB1778
> Project: Apache AsterixDB
> Issue Type: Improvement
> Reporter: Taewoo Kim
> Assignee: Taewoo Kim
> Priority: Minor
>
> Local optimization of the current editdistance handling code in AsterixDB: Can we terminate
the edit distance calculation based on a given editdistance threshold T? I think the answer
is yes.
> Basic Algorithm:
> For two strings X and Y,
> Init:
> Construct D(len(X)+1=M, len(Y)+1=N).
> D(i,0) = i
> D(j,0) = j
> Iteration:
> {code}
> For each i = 1 to M
> For each j = 1 to N
> D(i,j) = min( D(i1,j) + 1 // deletion case
> D(i,j1) + 1 // insertion case
> D(i1,j1) + cost (cost is 2 if X(i) != Y(j),
> cost is 0 if X(i) = Y(j))
> )
> {code}
> Result:
> D(M,N) is the edit distance value.
> Example:
> Early Termination condition:
> So, for the given threshold T, we can stop the computation early if the values of three
cases are greater than T. That is,
> min (D(i1,j), D(i,j1), D(i1,j1)) > T
> This holds since if the cost of all possible cases (insertion, deletion, and substitution)
is greater than T, all future operations will be greater than T in any cases.

This message was sent by Atlassian JIRA
(v6.3.15#6346)
