lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Fuad Efendi" <f...@efendi.ca>
Subject RE: [jira] Commented: (LUCENE-2089) explore using automaton for fuzzyquery
Date Thu, 11 Feb 2010 14:53:33 GMT

Hi Eks Dev, thanks for the comment!

I like this:
> If you need too use this, nothing simpler, you do not even need pair
> comparison (aka traversal), just Index terms split into bigrams and
> search with standard Query.


And, I need to study all the tricks of Automaton...


-Fuad




> -----Original Message-----
> From: Eks Dev (JIRA) [mailto:jira@apache.org]
> Sent: February-11-10 5:04 AM
> To: java-dev@lucene.apache.org
> Subject: [jira] Commented: (LUCENE-2089) explore using automaton for
> fuzzyquery
> 
> 
>     [ https://issues.apache.org/jira/browse/LUCENE-
> 2089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-
> tabpanel&focusedCommentId=12832424#action_12832424 ]
> 
> Eks Dev commented on LUCENE-2089:
> ---------------------------------
> 
> {quote}
>  What about this,
> http://www.catalysoft.com/articles/StrikeAMatch.html
> it seems logically more appropriate to (human-entered) text objects than
> Levenshtein distance, and it is (in theory) extremely fast; is DFA-
> distance faster?
> {quote}
> 
> Is that only me who sees plain, vanilla bigram distance here? What is
> new or better in StrikeAMatch compared to the first phase of the current
> SpellCehcker (feeding PriorityQueue with candidates)?
> 
> If you need too use this, nothing simpler, you do not even need pair
> comparison (aka traversal), just Index terms split into bigrams and
> search with standard Query.
> 
> 
> Autmaton trick is a neat one. Imo,  the only thing that would work
> better is to make term dictionary real trie (ternary, n-ary, dfa, makes
> no big diff). Making TerrmDict some sort of trie/dfa would permit smart
> beam-search,  even without compiling query DFA. Beam search also makes
> implementation of better distances possible (Weighted Edit distance
> without "metric constraint" ). I guess this is going to be possible with
> Flex, Mike was allready talking about DFA Dictionary :)
> 
> It took a while to figure out the trick Robert pooled here, treating
> term dictionary as another DFA due to the sortedness, nice.
> 
> > explore using automaton for fuzzyquery
> > --------------------------------------
> >
> >                 Key: LUCENE-2089
> >                 URL: https://issues.apache.org/jira/browse/LUCENE-2089
> >             Project: Lucene - Java
> >          Issue Type: Wish
> >          Components: Search
> >            Reporter: Robert Muir
> >            Assignee: Mark Miller
> >            Priority: Minor
> >         Attachments: LUCENE-2089.patch, Moman-0.2.1.tar.gz,
> TestFuzzy.java
> >
> >
> > Mark brought this up on LUCENE-1606 (i will assign this to him, I know
> he is itching to write that nasty algorithm)
> > we can optimize fuzzyquery by using AutomatonTermsEnum, here is my
> idea
> > * up front, calculate the maximum required K edits needed to match the
> users supplied float threshold.
> > * for at least small common E up to some max K (1,2,3, etc) we should
> create a DFA for each E.
> > if the required E is above our supported max, we use "dumb mode" at
> first (no seeking, no DFA, just brute force like now).
> > As the pq fills, we swap progressively lower DFAs into the enum, based
> upon the lowest score in the pq.
> > This should work well on avg, at high E, you will typically fill the
> pq very quickly since you will match many terms.
> > This not only provides a mechanism to switch to more efficient DFAs
> during enumeration, but also to switch from "dumb mode" to "smart mode".
> > i modified my wildcard benchmark to generate random fuzzy queries.
> > * Pattern: 7N stands for NNNNNNN, etc.
> > * AvgMS_DFA: this is the time spent creating the automaton
> (constructor)
> > ||Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA||
> > |7N|10|64.0|4155.9|38.6|20.3|
> > |14N|10|0.0|2511.6|46.0|37.9|
> > |28N|10|0.0|2506.3|93.0|86.6|
> > |56N|10|0.0|2524.5|304.4|298.5|
> > as you can see, this prototype is no good yet, because it creates the
> DFA in a slow way. right now it creates an NFA, and all this wasted time
> is in NFA->DFA conversion.
> > So, for a very long string, it just gets worse and worse. This has
> nothing to do with lucene, and here you can see, the TermEnum is fast
> (AvgMS - AvgMS_DFA), there is no problem there.
> > instead we should just build a DFA to begin with, maybe with this
> paper: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652
> > we can precompute the tables with that algorithm up to some reasonable
> K, and then I think we are ok.
> > the paper references using
> http://portal.acm.org/citation.cfm?id=135907 for linear minimization, if
> someone wants to implement this they should not worry about
> minimization.
> > in fact, we need to at some point determine if AutomatonQuery should
> even minimize FSM's at all, or if it is simply enough for them to be
> deterministic with no transitions to dead states. (The only code that
> actually assumes minimal DFA is the "Dumb" vs "Smart" heuristic and this
> can be rewritten as a summation easily). we need to benchmark really
> complex DFAs (i.e. write a regex benchmark) to figure out if
> minimization is even helping right now.
> 
> --
> This message is automatically generated by JIRA.
> -
> You can reply to this email to add a comment to the issue online.
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org




---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message