lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From <karl.wri...@nokia.com>
Subject RE: LevenshteinFilter proposal
Date Mon, 26 Jul 2010 16:51:19 GMT
Is the automaton creation code checked in anywhere?
Also, if it’s not possible to generate the automaton table on the fly, perhaps it could
be included in an optional jar (whose existence would be queried by reflection).  The reason
I’m interested in LD’s of three, especially, is because I can well imagine phonetic misspellings
that would have distances on that order.

Karl


From: ext Robert Muir [mailto:rcmuir@gmail.com]
Sent: Monday, July 26, 2010 11:49 AM
To: dev@lucene.apache.org
Subject: Re: LevenshteinFilter proposal

I wanted to show you what i mean, just so you know:

here is what 'ant createLevAutomata' does now:
createLevAutomata:
     [exec] Wrote Lev1ParametricDescription.java [102 lines; 3.7 KB]
     [exec] Wrote Lev2ParametricDescription.java [147 lines; 9.6 KB]

BUILD SUCCESSFUL
Total time: 4 seconds

I modified it with the patch below to generate 3 and 4, just for kicks, and ran it again...
it took it 7 minutes to generate lev3, its still running on lev4 (might take hours or days,
i dont actually know)

createLevAutomata:
     [exec] Wrote Lev1ParametricDescription.java [102 lines; 3.7 KB]
     [exec] Wrote Lev2ParametricDescription.java [147 lines; 9.6 KB]
     [exec] Wrote Lev3ParametricDescription.java [333 lines; 167.2 KB]
...

Index: build.xml
===================================================================
--- build.xml   (revision 959288)
+++ build.xml   (working copy)
@@ -533,6 +533,8 @@
   <target name="createLevAutomata" depends="check-moman,clone-moman,pull-moman">
        <createLevAutomaton n="1"/>
        <createLevAutomaton n="2"/>
+       <createLevAutomaton n="3"/>
+       <createLevAutomaton n="4"/>
   </target>

   <target name="check-moman">

On Mon, Jul 26, 2010 at 11:39 AM, Michael McCandless <lucene@mikemccandless.com<mailto:lucene@mikemccandless.com>>
wrote:
On Mon, Jul 26, 2010 at 11:35 AM, Robert Muir <rcmuir@gmail.com<mailto:rcmuir@gmail.com>>
wrote:
> On Mon, Jul 26, 2010 at 11:30 AM, <karl.wright@nokia.com<mailto:karl.wright@nokia.com>>
wrote:
>>
>> It occurs to me that the proper static initializer code might well be able
>> to generate distances of 3, 4 or whatever, without bloating the jar.
>
> I disagree: generating these distances is a non-trivial algorithm and takes
> minutes or hours to compute...
Furthermore, our own implementation for this algo is in Python :)

>> Nevertheless, the real question of import to me right now is: what
>> “minimumDistance” value corresponds to a Levenshtein distance of 1?  2?  The
>> maximum “minimumDistance” the query accepts is 1.0.
>
> right, fuzzyquery uses a strange formula that doesnt tie directly to edit
> distance (unless you factor in length).
> you can do this easier like this:
> LevenshteinAutomata builder = new LevenshteinAutomata("foobar");
> Automaton ed2 = builder.toAutomaton(2);
> // the term doesnt really matter, just for the field name and toString, etc.
> AutomatonQuery query = new AutomatonQuery(new Term("field", "foobar~2"),
> ed2);
>
>>
>>
>>
>> Karl
>>
>>
>>
>>
>>
>> From: ext Robert Muir [mailto:rcmuir@gmail.com<mailto:rcmuir@gmail.com>]
>> Sent: Friday, July 23, 2010 11:22 AM
>>
>> To: dev@lucene.apache.org<mailto:dev@lucene.apache.org>
>> Subject: Re: LevenshteinFilter proposal
>>
>>
>>
>> Well, there are two main things involved:
>>
>> 1. number of terms seeked to in the term enum
>>
>> 2. expense of the comparison itself.
>>
>>
>>
>> one challenge is the construction of a DFA LevK(x) that recognizes all
>> words within edit distance <= k of x is an expensive operation.
>>
>> This is because of the nature of the computation (traditionally its
>> dynamic programming with nasty runtime).
>>
>>
>>
>> We use http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.26.2940 to
>> offline the nasty part so its linear time: O(n) where n is the length of the
>> query term.
>>
>> But these offline-computated tables get pretty large quick, and so we only
>> include 1,2 to keep the lucene jar file down.
>>
>>
>>
>> additionally for n > 2 it starts to be a ton of termenum seeks... at some
>> of these high distances its faster to just next() thru the terms and
>> compare.
>>
>>
>>
>> we might be able to speed it up more by including say a table for n=3 (at
>> the expense of increased jar size), but not using this n=3 table for #1,
>> only to speed up #2. but its diminishing returns here...
>>
>>
>>
>> On Fri, Jul 23, 2010 at 11:09 AM, <karl.wright@nokia.com<mailto:karl.wright@nokia.com>>
wrote:
>>
>> Glad I asked.
>>
>>
>>
>> I would think that the automaton would be superior even for larger edit
>> distances than 1 or 2 than the equivalent “crappy” algorithm.  But maybe I
>> don’t understand something. ;-)
>>
>>
>>
>> Karl
>>
>>
>>
>>
>>
>> From: ext Robert Muir [mailto:rcmuir@gmail.com<mailto:rcmuir@gmail.com>]
>> Sent: Friday, July 23, 2010 11:05 AM
>>
>> To: dev@lucene.apache.org<mailto:dev@lucene.apache.org>
>>
>> Subject: Re: LevenshteinFilter proposal
>>
>>
>>
>> this is actually done in trunk.
>>
>>
>>
>> In trunk fuzzy's enum is a "proxy". for low distances (ed=1,2) it uses
>> automaton.
>>
>>
>>
>> for higher distances it uses the crappy "brute force" method.
>>
>> but, higher distances still get accelerated if you use a reasonable
>> 'maxExpansions' to FuzzyQuery... the default is quite bad (1024).
>>
>>
>>
>>
>>
>> On Fri, Jul 23, 2010 at 10:59 AM, <karl.wright@nokia.com<mailto:karl.wright@nokia.com>>
wrote:
>>
>> Thanks!
>>
>>
>>
>> FuzzyQuery will do for my purposes, for the interim.  But I suspect that
>> FuzzyQuery could be made a lot more efficient if it were rebuilt on top of
>> Automaton, no?  I understand that this would be a trunk project.
>>
>>
>>
>> Karl
>>
>>
>>
>>
>>
>> From: ext Uwe Schindler [mailto:uwe@thetaphi.de<mailto:uwe@thetaphi.de>]
>> Sent: Friday, July 23, 2010 10:45 AM
>>
>> To: dev@lucene.apache.org<mailto:dev@lucene.apache.org>
>>
>> Subject: RE: LevenshteinFilter proposal
>>
>>
>>
>> Automaton is only in Lucene/Solr Trunk. To get a filter out of FuzzyQuery,
>> use MultiTermQueryWrapperFilter(new FuzzyQuery(…))
>>
>>
>>
>> -----
>>
>> Uwe Schindler
>>
>> H.-H.-Meier-Allee 63, D-28213 Bremen
>>
>> http://www.thetaphi.de
>>
>> eMail: uwe@thetaphi.de<mailto:uwe@thetaphi.de>
>>
>>
>>
>> From: karl.wright@nokia.com<mailto:karl.wright@nokia.com> [mailto:karl.wright@nokia.com<mailto:karl.wright@nokia.com>]
>> Sent: Friday, July 23, 2010 4:25 PM
>> To: dev@lucene.apache.org<mailto:dev@lucene.apache.org>
>> Subject: LevenshteinFilter proposal
>>
>>
>>
>> Hi Folks,
>>
>> I’m very interested in using (or developing!) a Levenshtein Filter within
>> the family of Solr Filter objects. I don’t see such a class today anywhere.
>> I see how the AutomatonQuery object would permit such a thing to be built,
>> but to date I don’t know of anyone who has built one. Do you?  If not, I’m
>> willing to give it a whirl.  Also, AutomatonQuery doesn’t seem to come up
>> when I look for it in the javadocs for Lucene – can you point me in the
>> correct direction?
>>
>> Thanks!
>> Karl
>>
>>
>>
>>
>>
>>
>> --
>> Robert Muir
>> rcmuir@gmail.com<mailto:rcmuir@gmail.com>
>>
>>
>> --
>> Robert Muir
>> rcmuir@gmail.com<mailto:rcmuir@gmail.com>
>
>
> --
> Robert Muir
> rcmuir@gmail.com<mailto:rcmuir@gmail.com>
>
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org<mailto:dev-unsubscribe@lucene.apache.org>
For additional commands, e-mail: dev-help@lucene.apache.org<mailto:dev-help@lucene.apache.org>



--
Robert Muir
rcmuir@gmail.com<mailto:rcmuir@gmail.com>
Mime
View raw message