lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From <karl.wri...@Nokia.com>
Subject RE: LevenshteinFilter proposal
Date Mon, 26 Jul 2010 15:30:14 GMT
It occurs to me that the proper static initializer code might well be able to generate distances
of 3, 4 or whatever, without bloating the jar.

Nevertheless, the real question of import to me right now is: what “minimumDistance” value
corresponds to a Levenshtein distance of 1?  2?  The maximum “minimumDistance” the query
accepts is 1.0.

Karl


From: ext Robert Muir [mailto:rcmuir@gmail.com]
Sent: Friday, July 23, 2010 11:22 AM
To: dev@lucene.apache.org
Subject: Re: LevenshteinFilter proposal

Well, there are two main things involved:
1. number of terms seeked to in the term enum
2. expense of the comparison itself.

one challenge is the construction of a DFA LevK(x) that recognizes all words within edit distance
<= k of x is an expensive operation.
This is because of the nature of the computation (traditionally its dynamic programming with
nasty runtime).

We use http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.26.2940 to offline the nasty
part so its linear time: O(n) where n is the length of the query term.
But these offline-computated tables get pretty large quick, and so we only include 1,2 to
keep the lucene jar file down.

additionally for n > 2 it starts to be a ton of termenum seeks... at some of these high
distances its faster to just next() thru the terms and compare.

we might be able to speed it up more by including say a table for n=3 (at the expense of increased
jar size), but not using this n=3 table for #1, only to speed up #2. but its diminishing returns
here...

On Fri, Jul 23, 2010 at 11:09 AM, <karl.wright@nokia.com<mailto:karl.wright@nokia.com>>
wrote:
Glad I asked.

I would think that the automaton would be superior even for larger edit distances than 1 or
2 than the equivalent “crappy” algorithm.  But maybe I don’t understand something. ;-)

Karl


From: ext Robert Muir [mailto:rcmuir@gmail.com<mailto:rcmuir@gmail.com>]
Sent: Friday, July 23, 2010 11:05 AM

To: dev@lucene.apache.org<mailto:dev@lucene.apache.org>
Subject: Re: LevenshteinFilter proposal

this is actually done in trunk.

In trunk fuzzy's enum is a "proxy". for low distances (ed=1,2) it uses automaton.

for higher distances it uses the crappy "brute force" method.
but, higher distances still get accelerated if you use a reasonable 'maxExpansions' to FuzzyQuery...
the default is quite bad (1024).


On Fri, Jul 23, 2010 at 10:59 AM, <karl.wright@nokia.com<mailto:karl.wright@nokia.com>>
wrote:
Thanks!

FuzzyQuery will do for my purposes, for the interim.  But I suspect that FuzzyQuery could
be made a lot more efficient if it were rebuilt on top of Automaton, no?  I understand that
this would be a trunk project.

Karl


From: ext Uwe Schindler [mailto:uwe@thetaphi.de<mailto:uwe@thetaphi.de>]
Sent: Friday, July 23, 2010 10:45 AM

To: dev@lucene.apache.org<mailto:dev@lucene.apache.org>
Subject: RE: LevenshteinFilter proposal

Automaton is only in Lucene/Solr Trunk. To get a filter out of FuzzyQuery, use MultiTermQueryWrapperFilter(new
FuzzyQuery(…))

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de<http://www.thetaphi.de/>
eMail: uwe@thetaphi.de<mailto:uwe@thetaphi.de>

From: karl.wright@nokia.com<mailto:karl.wright@nokia.com> [mailto:karl.wright@nokia.com<mailto:karl.wright@nokia.com>]
Sent: Friday, July 23, 2010 4:25 PM
To: dev@lucene.apache.org<mailto:dev@lucene.apache.org>
Subject: LevenshteinFilter proposal

Hi Folks,

I’m very interested in using (or developing!) a Levenshtein Filter within the family of
Solr Filter objects. I don’t see such a class today anywhere. I see how the AutomatonQuery
object would permit such a thing to be built, but to date I don’t know of anyone who has
built one. Do you?  If not, I’m willing to give it a whirl.  Also, AutomatonQuery doesn’t
seem to come up when I look for it in the javadocs for Lucene – can you point me in the
correct direction?
Thanks!
Karl





--
Robert Muir
rcmuir@gmail.com<mailto:rcmuir@gmail.com>



--
Robert Muir
rcmuir@gmail.com<mailto:rcmuir@gmail.com>
Mime
View raw message