Here's a thoughtful discussion of Hirschberg's edit-distance algorithm: http://www.csse.monash.edu.au/~lloyd/tildeAlgDS/Dynamic/Hirsch/ In general, it appears to be *slower* than Levenstein, although it only uses space proportional to the shorter string (vs. Levenstein's product of the string lengths). They conclude it's only useful for space if the strings are long. Looks reasonable, but I have no experience with this algorithm. If only edit distance is needed and not alignment, which is how I understand FuzzyQuery, then you can also get along with space proportional to the smaller of the strings by only allocating one slice of the N x M dynamic programming table at a time. You can do this because each slice only depends on the previous slice, with the boundary conditions easily defined. The resulting algorithm looks a lot more like the well-known implementation than Hirschberg's does. You can re-use the slices, too, in a hand-over-hand fashion, so you do less allocation and garbage collection. You'll find a fairly clean Java implementation of this technique in LingPipe's WeightedEditDistance class. (http://www.alias-i.com/lingpipe) It even does transpose and allows weights on the edits (aka Smith-Waterman). There are even more space efficient algorithms if you're only looking to find things within a given edit distance. The standard Levenstein, etc., computes the edit distance -- it doesn't just answer "are they within distance k". The trick here is to prune the search when the edits get too large. This requires all edit costs to be positive. The real gain would be to do something like the edit-distance generalization of Aho-Corasick. The basic idea is that instead of n iterations of string vs. string, you do one iteration of string vs. trie. This improvement makes a *huge* difference in run time for typical tries composed of natural language terms. It's not that expensive to compute the trie, either -- it might even be worth doing on the fly. We have an implementation in the LingPipe class ApproximateDictionaryChunker, which is used to pull fuzzy matches against a dictionary (e.g. of protein names, product names or transliterated names, all of which have significant string variation) out of running text. There's an awesome book covering all of these algorithms: Dan Gusfield. Algorithms on Strings, Trees and Sequences. Cambridge Uni Press. Chapter 12 is all about approximate matching of sets of strings vs. a string using edit distance. You'll have to learn their notation for suffix trees along the way, though you won't need full suffix trees for matching against a dictionary. A bonus is that you get to learn about really cool biology sequencing applications and the *sub-linear* matching possible with exclusion techniques (mind-bogglingly clever algorithms). - Bob Carpenter http://www.colloquial.com/carp Alias-i http://www.alias-i.com/lingpipe eks dev wrote: > what could show measurably faster results is more in Jaro distance direction, but even than, the fact that you need to scan all terms in dictionary will make it prohibitive for large colletions. For small collections this can be packed in some smart data structures to restrict number of distance calculations... > good luck with it > >> I'm about to replace the edit-distance algorithm in FuzzyQuery from >> Levenstein to Hirschberg to save a couple of clockticks. >> > > Have you allready confirmed Hirschberg algorithm to be faster than current implementation of edit distance? I am not convinced it helps really. Hirschberg and standard DP algorithm both have O(|s1|*|s2|) time. The only saving is that Hirschberg needs O(|s2|) space using binary recursion (classic needs O(|s1|*|s2|) space). > --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org For additional commands, e-mail: java-dev-help@lucene.apache.org