lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From mark harwood <markharw...@yahoo.co.uk>
Subject Re: Edit-distance strategy
Date Thu, 08 Jun 2006 15:12:03 GMT
FWIW,  I integrated sourceforge's "SecondString" algos (http://secondstring.sourceforge.net/javadoc
) and others using a callout interface which boiled down to:

  float getDifference(String a, String b)

This seemed to be the cleanest lowest-common-denominator standard for plugging in string comparison
algos where each implementation produces a score between 0 and 1. 
I didn't find anything in the algorithm alternatives which was particularly faster or better
than the existing edit distance impl but I don't think it hurts to enable pluggability. 

On a related topic - if effort is being spent refactoring FuzzyQuery I would re-raise the
more pressing concern that the default IDF ranking in here is broken because it favours rare
terms (ie misspellings).

See contrib/queries/....FuzzyLikeThisQuery for alternative scoring strategies or the thread
here: http://www.mail-archive.com/java-user@lucene.apache.org/msg05811.html


Cheers
Mark



----- Original Message ----
From: eks dev <eksdev@yahoo.co.uk>
To: java-dev@lucene.apache.org
Sent: Thursday, 8 June, 2006 3:12:04 PM
Subject: Re: Edit-distance strategy

what could show measurably faster results is more in Jaro distance direction, but even than,
the fact that you need to scan all terms in dictionary will make it prohibitive for large
colletions. For small collections this can be packed in some smart data structures to restrict
number of distance calculations...
good luck with it

----- Original Message ----
From: eks dev <eksdev@yahoo.co.uk>
To: java-dev@lucene.apache.org
Sent: Thursday, 8 June, 2006 4:07:07 PM
Subject: Re: Edit-distance strategy


>I'm about to replace the edit-distance algorithm in FuzzyQuery from
>Levenstein to Hirschberg to save a couple of clockticks.

Have you allready confirmed Hirschberg algorithm to be faster than current implementation
of edit distance? I am not convinced it helps really. Hirschberg and standard DP algorithm
both have O(|s1|*|s2|) time. The only saving is that Hirschberg needs O(|s2|) space using
binary recursion (classic needs O(|s1|*|s2|) space). 




---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org





---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org





---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message