Return-Path: X-Original-To: apmail-lucene-dev-archive@www.apache.org Delivered-To: apmail-lucene-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id BBCC28BA0 for ; Wed, 10 Aug 2011 13:06:33 +0000 (UTC) Received: (qmail 50216 invoked by uid 500); 10 Aug 2011 13:06:32 -0000 Delivered-To: apmail-lucene-dev-archive@lucene.apache.org Received: (qmail 49955 invoked by uid 500); 10 Aug 2011 13:06:31 -0000 Mailing-List: contact dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@lucene.apache.org Delivered-To: mailing list dev@lucene.apache.org Received: (qmail 49948 invoked by uid 99); 10 Aug 2011 13:06:31 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 10 Aug 2011 13:06:31 +0000 X-ASF-Spam-Status: No, hits=2.9 required=5.0 tests=FREEMAIL_FROM,HTML_MESSAGE,RCVD_IN_DNSWL_NONE,SPF_NEUTRAL,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: neutral (nike.apache.org: local policy) Received: from [217.146.182.250] (HELO nm9.bullet.mail.ukl.yahoo.com) (217.146.182.250) by apache.org (qpsmtpd/0.29) with SMTP; Wed, 10 Aug 2011 13:06:23 +0000 Received: from [217.146.183.209] by nm9.bullet.mail.ukl.yahoo.com with NNFMP; 10 Aug 2011 13:03:10 -0000 Received: from [217.146.183.34] by tm2.bullet.mail.ukl.yahoo.com with NNFMP; 10 Aug 2011 13:03:10 -0000 Received: from [127.0.0.1] by omp1023.mail.ukl.yahoo.com with NNFMP; 10 Aug 2011 13:03:10 -0000 X-Yahoo-Newman-Property: ymail-3 X-Yahoo-Newman-Id: 762766.13226.bm@omp1023.mail.ukl.yahoo.com Received: (qmail 91056 invoked by uid 60001); 10 Aug 2011 13:03:10 -0000 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=yahoo.co.uk; s=s1024; t=1312981390; bh=YP6WoTpORwjSsLkq6UBJYl2l3k4o/jxlthOyGkUqC1c=; h=X-YMail-OSG:Received:X-Mailer:Message-ID:Date:From:Subject:To:MIME-Version:Content-Type; b=uSDzmF77fdNovG77ZPmmX/6FOCh1iOSbquxxf281SKT59FZbXBfiOMpvI/Zs4svurvKfJeUswAF+dWg8CQ1SfTVOyjw6iu+1yWO/Thwcjf9ZP47sVrQTOKEABvdGBI/z2OShtfqbVQc0X+u96vltZs/M9XGzsTcn81rTeZcqalg= DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws; s=s1024; d=yahoo.co.uk; h=X-YMail-OSG:Received:X-Mailer:Message-ID:Date:From:Subject:To:MIME-Version:Content-Type; b=IH24UUktaga3UwqL6MaBR8mUdDf5qrLeqNABrp2+dPYQtBV87Vz44cPDl0ABbn7OYCig9I1zxYnzi2VUiGXJvbBeTIyYed4IuHT2eSA7xqY2cRPuo/2dUIvsZ1zQmZYWm3+wLv0xAm4ZhTyJFrEfRmu0tGepizdB/L4asDNJr40=; X-YMail-OSG: fL0oS_8VM1m5ko3lTzC6xhPqnsp1d0gE..eji6r1PgNlYfM znkWUgsx4djGJAY5L.f3AhsQ7V5Hluo9KJoUS7jHNduA4_BQnCFozLEl9sBz 5hiW1ib7uOJxL_RrhGgKFitwudS7Q2MnNzA7U7AmvgnAcTUnNWKOUnGgL.P2 aJVLWKLyM5YciYPK9EQCOSrAvY9BbPJ_fYeUa6tncQc5gx47nQqYOoRHSbrU R3VZKh8SCJ9YjD9Pjagdgd.wT8bCRxON0dAzFo2snYcLMf9k9NCymgD2T8Gl ArY864HvhmhNahUY3kxWOhEV8PDMtcUfJexcIwX9KActQae31TgSR.yOPCfN s.ra8aSN_Wg_0hMtpCpIVtULVL1o08aT3IuuFRMOCv5NdLp7q3ln2t02LB5D GlGH2pwflEMPrPg33mG8SO.OQTCB4jkr7x6s2PPJk_iLwRVxOIbtBAdpD5f4 vgwUe_6xeZ52L8IKo0JveVqQFBm5Vp1npnsft_S9kwokFMKGQAAzldwZGm63 3NkyuiaM5A0sF Received: from [80.149.113.162] by web27107.mail.ukl.yahoo.com via HTTP; Wed, 10 Aug 2011 14:03:10 BST X-Mailer: YahooMailRC/574 YahooMailWebService/0.8.113.313619 Message-ID: <1312981390.85930.YahooMailRC@web27107.mail.ukl.yahoo.com> Date: Wed, 10 Aug 2011 14:03:10 +0100 (BST) From: eks dev Subject: Re: LevenshteinAutomata challenge To: dev@lucene.apache.org MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="0-132899840-1312981390=:85930" X-Virus-Checked: Checked by ClamAV on apache.org --0-132899840-1312981390=:85930 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Hi Robert, =0Aindeed, transpositions can be embedded into Lev, Automaton. I= share my opinion =0Awith you, even stronger, I think transpositions are fu= ndamental for spell-check =0Atype of apps. =0A=0A=0ABut, even if implemente= d, it would not change the point from my example, only =0Aslightly. ("To su= pport efficiently..." )=0A=0A // assuming transposition c= ounts as one, still has much more =0Ato do ( "bigger search space", or what= ever is this called) =0A=0Athen this composition=0A=0A=0A=0ALev("CARIN", 1) =0Avs =0A=0A=0AAnd so on and so on...=0A=0ASo yes, transpositions would be great, b= ut they are orthogonal to "regular =0Acomposition of automata" (Lev. and r= egex and however one constructs them) . =0A=0A=0AThis composition of automa= ta enables us to restrict space significantly, and add =0Aheuristics that h= elp improve match quality (precision).=0AIn my experience, plain edit dista= nce without heuristics is rarely useful, =0Aespecially on shorter than, say= 7 chars, It generates a lot of candidates and =0Aone needs re-scoring phas= e .... which is in turn slow, if too many candidates. =0ASo we come back to= this, "automata composition" , with it, you could push some =0Aof these "= heuristics" back to candidate fetching phase to increase precision of =0Ath= is "candidate fetching phase with Edit distance" ... =0A=0A=0ATheoretical a= lternative, less flexible/clean, would be to somehow add parameters =0Ato L= ev. automata builder, like "I want to have max distance =3D 1 from 3rd =0Ac= haracter on, but only transposition without deletes/inserts on first two = =0Acharacters".... looks ugly =0A=0A=0ABut Mike and Dawid said this is alre= ady there, works... only some API sugar is =0Aneeded! =0A=0ALovely :)=0A=0A= =0ARe "... though if you don't want to implement it, you could let him know= are =0Ainterested..."=0A=0AI would really live to, but to find that much t= ime to understand this hairy code =0Ais at the moment impossible. =0A=0A=0A= Thanks to all, it turned out that your answer to my challenge was simple, = =0A"done, we have it already!" ;)=0A=0A=0A=0A> in order to support transpos= ition in the first two characters like in=0A> my example, you would need Le= v. Automaton that has maxDistance 2,=0A=0AActually this isn't true: we can = implement the variant where=0Atranspositions are a basic edit operation:=0A= http://citeseerx.ist.psu.edu/viewdoc/summary?doi=3D10.1.1.16.652 (Chapter 7= )=0A=0AI think this transpositions variant would really be more ideal for= =0Aspellchecking.=0AI think it would actually be best/easiest if this was i= mplemented in=0Apython in moman: https://bitbucket.org/jpbarrette/moman/=0A= =0AJean-Philippe Barrette-LaPierre has told me before he was interested=0Ai= n implementing it, but I think his time is limited, though if you=0Adon't w= ant to implement it, you could let him know are interested. --0-132899840-1312981390=:85930 Content-Type: text/html; charset=utf-8 Content-Transfer-Encoding: quoted-printable
Hi Robert,
indeed, transpositions can be embedde= d into Lev, Automaton. I share my opinion with you, even stronger, I think = transpositions are fundamental for spell-check type of apps.

But, e= ven if implemented, it would not change the point from my example, only sli= ghtly. ("To support efficiently..." )

<Lev("ABXYZ", 1)> // ass= uming transposition  counts as one, still has much more to do ( "bigge= r search space", or whatever is this called)
then this composition
&= lt;RegEx("(AB)|(BA)")><Lev("XYZ", 1)>


Lev("CARIN", 1) <= br>vs
<RegEx("[CK]")><Lev("ARIN", 1)>

And so on and = so on...

So yes, transpositions would be great, but they are orthogo= nal to "regular composition of automata"  (Lev. and regex and however one constructs them) .

This composition of automata enable= s us to restrict space significantly, and add heuristics that help improve = match quality (precision).
In my experience, plain edit distance without= heuristics is rarely useful, especially on shorter than, say 7 chars, It g= enerates a lot of candidates and one needs re-scoring phase .... which is i= n turn slow, if too many candidates. So we come back to this, "automata com= position" ,  with it, you could push some of these "heuristics" back t= o candidate fetching phase to increase precision of this "candidate fetchin= g phase with Edit distance" ...

Theoretical alternative, less flexi= ble/clean, would be to somehow add parameters to Lev. automata builder, lik= e "I want to have max distance =3D 1 from 3rd character on, but only transp= osition without deletes/inserts on first two characters".... looks ugly
But Mike and Dawid said this is already there, works... only some API sugar is needed!
Lovely :)


Re "... though if you don't = want to implement it, you could let him know are interested..."

I wo= uld really live to, but to find that much time to understand this hairy cod= e is at the moment impossible.

Thanks to all, it turned out that yo= ur answer to my challenge was simple,  "done, we have it already!" ;)<= br>


> in order to support transposition in the first two char= acters like in
=0A> my example, you would need Lev. Automaton that ha= s maxDistance 2,
=0A
=0AActually this isn't true: we can implement th= e variant where
=0Atranspositions are a basic edit operation:
=0Ahttp://citeseerx.ist.psu.edu/viewdoc/summary?doi=3D10.= 1.1.16.652 (Chapter 7)
=0A
=0AI think this transpositions va= riant would really be more ideal for
=0Aspellchecking.
=0AI think it = would actually be best/easiest if this was implemented in
=0Apython in m= oman: https://bitbucket.org/jpbarrette/moman/
=0A
=0AJean-Philipp= e Barrette-LaPierre has told me before he was interested
=0Ain implement= ing it, but I think his time is limited, though if you
=0Adon't want to = implement it, you could let him know are interested.
=0A=0A=0A=0A --0-132899840-1312981390=:85930--