lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From eks dev <eks...@yahoo.co.uk>
Subject Re: LevenshteinAutomata challenge
Date Wed, 10 Aug 2011 13:03:10 GMT
Hi Robert, 
indeed, transpositions can be embedded into Lev, Automaton. I share my opinion 
with you, even stronger, I think transpositions are fundamental for spell-check 
type of apps. 


But, even if implemented, it would not change the point from my example, only 
slightly. ("To support efficiently..." )

<Lev("ABXYZ", 1)> // assuming transposition  counts as one, still has much more 
to do ( "bigger search space", or whatever is this called) 

then this composition
<RegEx("(AB)|(BA)")><Lev("XYZ", 1)>


Lev("CARIN", 1) 
vs 
<RegEx("[CK]")><Lev("ARIN", 1)>

And so on and so on...

So yes, transpositions would be great, but they are orthogonal to "regular 
composition of automata"  (Lev. and regex and however one constructs them) . 


This composition of automata enables us to restrict space significantly, and add 
heuristics that help improve match quality (precision).
In my experience, plain edit distance without heuristics is rarely useful, 
especially on shorter than, say 7 chars, It generates a lot of candidates and 
one needs re-scoring phase .... which is in turn slow, if too many candidates. 
So we come back to this, "automata composition" ,  with it, you could push some 
of these "heuristics" back to candidate fetching phase to increase precision of 
this "candidate fetching phase with Edit distance" ... 


Theoretical alternative, less flexible/clean, would be to somehow add parameters 
to Lev. automata builder, like "I want to have max distance = 1 from 3rd 
character on, but only transposition without deletes/inserts on first two 
characters".... looks ugly 


But Mike and Dawid said this is already there, works... only some API sugar is 
needed! 

Lovely :)


Re "... though if you don't want to implement it, you could let him know are 
interested..."

I would really live to, but to find that much time to understand this hairy code 
is at the moment impossible. 


Thanks to all, it turned out that your answer to my challenge was simple,  
"done, we have it already!" ;)



> in order to support transposition in the first two characters like in
> my example, you would need Lev. Automaton that has maxDistance 2,

Actually this isn't true: we can implement the variant where
transpositions are a basic edit operation:
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652 (Chapter 7)

I think this transpositions variant would really be more ideal for
spellchecking.
I think it would actually be best/easiest if this was implemented in
python in moman: https://bitbucket.org/jpbarrette/moman/

Jean-Philippe Barrette-LaPierre has told me before he was interested
in implementing it, but I think his time is limited, though if you
don't want to implement it, you could let him know are interested.
Mime
View raw message