lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From eks dev <eks...@googlemail.com>
Subject Re: LevenshteinAutomata challenge
Date Wed, 10 Aug 2011 11:22:03 GMT
Thanks David,
That would be super duper great!

<Regex1><LEV><Regex2>...

the trick with "smart skipping" terms enum is not changing with
automaton construction? These two steps are independent... if you
build infinite Automaton like REgex(".*"), your problem :)

This cannot be better at the moment.  One day, TermDict will become
automaton as well... but for now we live with this brilliant idea to
"use sorted list of terms as other automaton for matching" . But even
than, REgex(".*") should match everything :)


My proposal is about performance and flexibility
Compare simple use case:

in order to support transposition in the first two characters like in
my example, you would need Lev. Automaton that has maxDistance 2, so
it would waste time to scan *much much more* than needed for given
constraint. It does not matter if you match against "sorted list" or
another automaton.

Alternative would be to run two fuzzy enums with Lev automata with
maxDistance = 1.  and filter them out on prefix later. But these are
two separate "term enum scans"

With concatenation, "composite automaton" would scan prefix part like
normal finite regex prefixes (fast!) and the rest with Lev. automaton
with maxDistance 1, using whatever is at the moment possible (sorted
list, or another automation like mikes FST TermDictionary)

Another example would be to add some phonetic variations here and there...

"CARIN"   <Regex("C|K")><Lev("ARIN")>

Point being, this enables us to restrict search space of Lev automaton
significantly (which is normally huge as simple transposition needs
minDistance 2 ...) to achieve what we want. (e.g. most of CARIN
matches with LEV(2) are mostly nonsense)

one could also restrict suffixes easily, with supported intersection,
one could do other wild things...

I'll definitely have to look at this "scary" code as Mike indicated in his blog

Cheers, eks



On Wed, Aug 10, 2011 at 11:07 AM, Dawid Weiss
<dawid.weiss@cs.put.poznan.pl> wrote:
>
> The thing you're describing is a regular composition of automata (as it
> exists, for example, when composing clauses of a regular expression). If I
> recall right the Levenshtein automaton in Lucene is built on modified brics
> code... if so then this should be not a problem. The problem may be that
> currently automatons are used in enums in a way that skips from one accepted
> sequence to another accepted sequence (if possible). If the automaton has *
> operators then there is no way to establish these and everything falls back
> to full matching strategy.
> Dawid
>
> On Wed, Aug 10, 2011 at 10:54 AM, eks dev <eksdev@yahoo.co.uk> wrote:
>>
>> Hi Robert, Mike & other FS(A|T) gurus,
>>
>> a challenge for you ;)
>>
>> Would it be possible to combine these brilliant peaces of functionality
>> with normal Automaton somehow...
>>
>> Example to illustrate.
>> DirectSpellChecker:
>> - where instead of minPrefix, we would specify Regex (other Automaton)
>> pfxAutiomaton = Regex("(AB)|(BA)") // e.g. Saying,
>> levAutomaton = LevenshteinAutomata("XYZ")
>>
>> spell(pfxAutomaton, levAutomaton);
>>
>> would match terms that start with "AB" or "BA" and suffix part are normal
>> edit distance matches, like ABXY, with one delete
>> This would support wild things, like "enable only transpositions in first
>> three characters"... In order to gat these matches today, you need to make
>> Lev. Automata with maxDistance = 2 (which is then HUGE space to search
>> without prefix)... Or generate more Lev. automata and make union of results
>> (expensive to itterate)
>>
>> Other good use cases are simple to construct...
>>
>> The most general question, can we support at least concatenation between
>> LevenshteinAutomata  and normal Automata. Intersection/union would be crazy
>> thing as well? Where we would have:
>> FilteringAutomata.intersect(LevenshteinAutomata)... but I guess I am
>> dreaming with this one, but concatenation sounds  doable (at least prefix
>> side)
>>
>> Cheers,
>> Eks
>>
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message