lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael McCandless <luc...@mikemccandless.com>
Subject Re: LevenshteinFilter proposal
Date Mon, 26 Jul 2010 15:39:19 GMT
On Mon, Jul 26, 2010 at 11:35 AM, Robert Muir <rcmuir@gmail.com> wrote:
> On Mon, Jul 26, 2010 at 11:30 AM, <karl.wright@nokia.com> wrote:
>>
>> It occurs to me that the proper static initializer code might well be able
>> to generate distances of 3, 4 or whatever, without bloating the jar.
>
> I disagree: generating these distances is a non-trivial algorithm and takes
> minutes or hours to compute...

Furthermore, our own implementation for this algo is in Python :)

>> Nevertheless, the real question of import to me right now is: what
>> “minimumDistance” value corresponds to a Levenshtein distance of 1?  2?  The
>> maximum “minimumDistance” the query accepts is 1.0.
>
> right, fuzzyquery uses a strange formula that doesnt tie directly to edit
> distance (unless you factor in length).
> you can do this easier like this:
> LevenshteinAutomata builder = new LevenshteinAutomata("foobar");
> Automaton ed2 = builder.toAutomaton(2);
> // the term doesnt really matter, just for the field name and toString, etc.
> AutomatonQuery query = new AutomatonQuery(new Term("field", "foobar~2"),
> ed2);
>
>>
>>
>>
>> Karl
>>
>>
>>
>>
>>
>> From: ext Robert Muir [mailto:rcmuir@gmail.com]
>> Sent: Friday, July 23, 2010 11:22 AM
>>
>> To: dev@lucene.apache.org
>> Subject: Re: LevenshteinFilter proposal
>>
>>
>>
>> Well, there are two main things involved:
>>
>> 1. number of terms seeked to in the term enum
>>
>> 2. expense of the comparison itself.
>>
>>
>>
>> one challenge is the construction of a DFA LevK(x) that recognizes all
>> words within edit distance <= k of x is an expensive operation.
>>
>> This is because of the nature of the computation (traditionally its
>> dynamic programming with nasty runtime).
>>
>>
>>
>> We use http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.26.2940 to
>> offline the nasty part so its linear time: O(n) where n is the length of the
>> query term.
>>
>> But these offline-computated tables get pretty large quick, and so we only
>> include 1,2 to keep the lucene jar file down.
>>
>>
>>
>> additionally for n > 2 it starts to be a ton of termenum seeks... at some
>> of these high distances its faster to just next() thru the terms and
>> compare.
>>
>>
>>
>> we might be able to speed it up more by including say a table for n=3 (at
>> the expense of increased jar size), but not using this n=3 table for #1,
>> only to speed up #2. but its diminishing returns here...
>>
>>
>>
>> On Fri, Jul 23, 2010 at 11:09 AM, <karl.wright@nokia.com> wrote:
>>
>> Glad I asked.
>>
>>
>>
>> I would think that the automaton would be superior even for larger edit
>> distances than 1 or 2 than the equivalent “crappy” algorithm.  But maybe I
>> don’t understand something. ;-)
>>
>>
>>
>> Karl
>>
>>
>>
>>
>>
>> From: ext Robert Muir [mailto:rcmuir@gmail.com]
>> Sent: Friday, July 23, 2010 11:05 AM
>>
>> To: dev@lucene.apache.org
>>
>> Subject: Re: LevenshteinFilter proposal
>>
>>
>>
>> this is actually done in trunk.
>>
>>
>>
>> In trunk fuzzy's enum is a "proxy". for low distances (ed=1,2) it uses
>> automaton.
>>
>>
>>
>> for higher distances it uses the crappy "brute force" method.
>>
>> but, higher distances still get accelerated if you use a reasonable
>> 'maxExpansions' to FuzzyQuery... the default is quite bad (1024).
>>
>>
>>
>>
>>
>> On Fri, Jul 23, 2010 at 10:59 AM, <karl.wright@nokia.com> wrote:
>>
>> Thanks!
>>
>>
>>
>> FuzzyQuery will do for my purposes, for the interim.  But I suspect that
>> FuzzyQuery could be made a lot more efficient if it were rebuilt on top of
>> Automaton, no?  I understand that this would be a trunk project.
>>
>>
>>
>> Karl
>>
>>
>>
>>
>>
>> From: ext Uwe Schindler [mailto:uwe@thetaphi.de]
>> Sent: Friday, July 23, 2010 10:45 AM
>>
>> To: dev@lucene.apache.org
>>
>> Subject: RE: LevenshteinFilter proposal
>>
>>
>>
>> Automaton is only in Lucene/Solr Trunk. To get a filter out of FuzzyQuery,
>> use MultiTermQueryWrapperFilter(new FuzzyQuery(…))
>>
>>
>>
>> -----
>>
>> Uwe Schindler
>>
>> H.-H.-Meier-Allee 63, D-28213 Bremen
>>
>> http://www.thetaphi.de
>>
>> eMail: uwe@thetaphi.de
>>
>>
>>
>> From: karl.wright@nokia.com [mailto:karl.wright@nokia.com]
>> Sent: Friday, July 23, 2010 4:25 PM
>> To: dev@lucene.apache.org
>> Subject: LevenshteinFilter proposal
>>
>>
>>
>> Hi Folks,
>>
>> I’m very interested in using (or developing!) a Levenshtein Filter within
>> the family of Solr Filter objects. I don’t see such a class today anywhere.
>> I see how the AutomatonQuery object would permit such a thing to be built,
>> but to date I don’t know of anyone who has built one. Do you?  If not, I’m
>> willing to give it a whirl.  Also, AutomatonQuery doesn’t seem to come up
>> when I look for it in the javadocs for Lucene – can you point me in the
>> correct direction?
>>
>> Thanks!
>> Karl
>>
>>
>>
>>
>>
>>
>> --
>> Robert Muir
>> rcmuir@gmail.com
>>
>>
>> --
>> Robert Muir
>> rcmuir@gmail.com
>
>
> --
> Robert Muir
> rcmuir@gmail.com
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message