lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bob Carpenter <>
Subject Re: Edit-distance strategy (slicing and one vs. all algorithms)
Date Thu, 08 Jun 2006 14:54:13 GMT
Here's a thoughtful discussion of Hirschberg's
edit-distance algorithm:

In general, it appears to be *slower* than Levenstein, although
it only uses space proportional to the shorter string (vs. Levenstein's
product of the string lengths).

They conclude it's only useful for space if the strings are long.
Looks reasonable, but I have no experience with
this algorithm.

If only edit distance is needed and not alignment,
which is how I understand FuzzyQuery, then
you can also get along with space proportional
to the smaller of the strings by only allocating one
slice of the N x M dynamic programming table at
a time.  You can do this because each slice only
depends on the previous slice, with the boundary
conditions easily defined.  The resulting algorithm
looks a lot more like the well-known implementation
than Hirschberg's does.

You can re-use the slices, too, in a hand-over-hand
fashion, so you do less allocation and garbage collection.
You'll find a fairly clean Java implementation of this
technique in LingPipe's WeightedEditDistance class.
(  It even does transpose
and allows weights on the edits (aka Smith-Waterman).

There are even more space efficient algorithms if you're
only looking to find things within a given edit distance.
The standard Levenstein, etc., computes the edit distance --
it doesn't just answer "are they within distance k".
The trick here is to prune the search when the edits get
too large.  This requires all edit costs to be positive.

The real gain would be to do something like the
edit-distance generalization of Aho-Corasick.  The
basic idea is that instead of n iterations of string vs. string,
you do one iteration of string vs. trie.  This improvement
makes a *huge* difference in run time for typical tries
composed of natural language terms.  It's not that
expensive to compute the trie, either -- it might even
be worth doing on the fly.  We have an implementation
in the LingPipe class ApproximateDictionaryChunker, which
is used to pull fuzzy matches against a dictionary (e.g. of
protein names, product names or transliterated names,
all of which have significant string variation) out of running

There's an awesome book covering all of these algorithms:

Dan Gusfield.  Algorithms on Strings, Trees and Sequences.
Cambridge Uni Press.

Chapter 12 is all about approximate matching of sets of
strings vs. a string using edit distance.  You'll have to learn
their notation for suffix trees along the way, though you
won't need full suffix trees for matching against a dictionary.

A bonus is that you get to learn about really cool biology
sequencing applications and the *sub-linear* matching
possible with exclusion techniques (mind-bogglingly
clever algorithms).

- Bob Carpenter

eks dev wrote:
> what could show measurably faster results is more in Jaro distance direction, but even
than, the fact that you need to scan all terms in dictionary will make it prohibitive for
large colletions. For small collections this can be packed in some smart data structures to
restrict number of distance calculations...
> good luck with it
>> I'm about to replace the edit-distance algorithm in FuzzyQuery from
>> Levenstein to Hirschberg to save a couple of clockticks.
> Have you allready confirmed Hirschberg algorithm to be faster than current implementation
of edit distance? I am not convinced it helps really. Hirschberg and standard DP algorithm
both have O(|s1|*|s2|) time. The only saving is that Hirschberg needs O(|s2|) space using
binary recursion (classic needs O(|s1|*|s2|) space). 

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message