Return-Path: X-Original-To: apmail-commons-dev-archive@www.apache.org Delivered-To: apmail-commons-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 667CB11834 for ; Wed, 11 Jun 2014 11:56:43 +0000 (UTC) Received: (qmail 4842 invoked by uid 500); 11 Jun 2014 11:56:42 -0000 Delivered-To: apmail-commons-dev-archive@commons.apache.org Received: (qmail 4727 invoked by uid 500); 11 Jun 2014 11:56:42 -0000 Mailing-List: contact dev-help@commons.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: "Commons Developers List" Delivered-To: mailing list dev@commons.apache.org Received: (qmail 4716 invoked by uid 99); 11 Jun 2014 11:56:42 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 11 Jun 2014 11:56:42 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of thomas.neidhart@gmail.com designates 209.85.216.179 as permitted sender) Received: from [209.85.216.179] (HELO mail-qc0-f179.google.com) (209.85.216.179) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 11 Jun 2014 11:56:40 +0000 Received: by mail-qc0-f179.google.com with SMTP id r5so4233097qcx.24 for ; Wed, 11 Jun 2014 04:56:15 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=0Q4Tt03WMtXzEsjO9P+B8kXAK5sjnjxyl51mdwAoYY0=; b=VCG9ibrzgzEs1JFIUHvvXF7tYm0/S/FozuNdYBFr3u4JLBKiI+kbE8nJhmAwDjpV5t eI91uE5s1IBU/qxgOiqaa8q392QKHQuORK/VeKwvsH6MbZl4jiFHjGgpMVJMfz3+C61l EyoNbP7izfwtybinvBQ/U1C2ZW3ClXul1Ha7Jx0j+mePTILujDrsov4quM4+KbqON8AN WA9NZGCojaDP/loBR5dYXrQAdly/VwfGIoyZg4ZdRN1h3tq6i9FV1ouQK4somt1afGvi x41bO3pJWm3rTNdNq9Jv4M/V9AUoYjGyiRwswo36yuEni32lZRMfO+ER2yDcHFfMjS6C eDKw== MIME-Version: 1.0 X-Received: by 10.224.49.131 with SMTP id v3mr72097qaf.70.1402487775703; Wed, 11 Jun 2014 04:56:15 -0700 (PDT) Received: by 10.140.20.199 with HTTP; Wed, 11 Jun 2014 04:56:15 -0700 (PDT) In-Reply-To: <03fa01cf8543$f54c11e0$dfe435a0$@tobias.org.uk> References: <03fa01cf8543$f54c11e0$dfe435a0$@tobias.org.uk> Date: Wed, 11 Jun 2014 13:56:15 +0200 Message-ID: Subject: Re: [CODEC] Beider Morse Phonetic Matching Bug and questions From: Thomas Neidhart To: Commons Developers List Content-Type: multipart/alternative; boundary=001a11c2f47cac3af604fb8e210a X-Virus-Checked: Checked by ClamAV on apache.org --001a11c2f47cac3af604fb8e210a Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Hi, as already commented on https://issues.apache.org/jira/browse/CODEC-187 the problem is related to some wrongly ported rule files from the original source. This otoh, creates a serious problem for us, as it looks like that the Beider-Morse phonetic matching encoder in commons-codec is derived work from a php codebase released under the GPLv3 licence. The original codebase is available at http://stevemorse.org/phoneticinfo.ht= m. While investigating the bug and comparing our rule file with the ones from the origina codebase it is quite clear that at least these are identical. The author of the patch (see https://issues.apache.org/jira/browse/CODEC-12= 5) ported the code and applied the Apache license, but the license of the original codebase was never considered or discussed. This is quite serious I guess, as we have already released the code. We can ask the original authors to re-license their code to the Apache Software Foundation under a compatible license, but I wonder if they are willing to do so. This encoder is also used a lot in lucene/solr so it might have even larger implications. Any ideas how to proceed or if a re-licensing would be sufficient in this case? Thomas On Wed, Jun 11, 2014 at 9:08 AM, Michael Tobias wrote: > Does anybody have a working knowledge of the coding of the Beider Morse > Phonetic Matching in the Apache Commons Codec? > > > > My recent tests using Solr suggest there is a discrepancy between Steve > Morse and Alexander Beider's algorithm and the algorithm currently live i= n > Solr (and hence the Commons Codec). > > > > I know that the source code for BMPM issued by Steve has changed several > times over the years, and I thought at first it might be that the version > used in the Commons Codec is an old version that has subsequently been > overtaken. Should the version of the BMPM algorithm not be listed in the > Commons Codec documentation? How should version changes to the algorithm = be > implemented? The algorithm is quite static now so this is probably not so > important now but surely it should be DOCUMENTED??? > > > > My tests now indicate that the discrepancies are NOT a version problem as > testing against a very old version 2.00 of the BMPM source code issued on > 18 > June 2009 still exhibits the same problem. > > > > Using just a single test term the results are not good. The only saving > grace is that the most widely used version is > > > > nameType=3D"GENERIC" ruleType=3D"APPROX" > > > > and that is a close (but not perfect) match at least for this ONE test > word. > > > > For the name Abram, all with languageSet=3D"auto" > > > > GENERIC APPROX - fails - misses a few tokens > > Should create tokens: abram abrom avram avrom obram obrom ovram ovrom abr= an > abron obran obron Ybram Ybrom > > Solr creates: abram abrom avram avrom obram obrom ovram ovrom abran abron > obran obron > > > > GENERIC EXACT - good! > > Should create tokens: abram abran > > Solr creates: abram abran > > > > ASHKENAZI APPROX: - fails dreadfully! > > Should create tokens: abram abrom avram avrom obram obrom ovram ovrom Ybr= am > Ybrom ombram ombrom imbram imbrom > > Solr creates: abrAm AvrAm BbrAm > > > > ASHKENAZI EXACT: - good! > > Should create tokens: abram > > Solr creates: abram > > > > SEPHARDIC APPROX: - good! > > Should create tokens: abram bram abran bran avram vram > > Solr creates: abram bram abran bran avram vram > > > > SEPHARDIC EXACT: - good! > > Should create tokens: abram abran avram > > Solr creates: abram abran avram > > > > I would appreciate it if somebody with knowledge of the programming of th= is > functionality could investigate. > > > > For the worst case I attach here a debug trace of the calculation of the > Ashkenazi Approx tokens straight from Steve Morse' implementation. It loo= ks > like some of the final rules are not being implemented properly, or at al= l. > The language codes in parenthesis vary from BMPM version to version but t= he > resulting tokens have not changed from version 2.00 up to the current 3.0= 2 > > > > Thanks > > > > Michael > > > > > > > > applying language rules from (rulesany) to abram using languages 2012 > > char codes =3D [#61]a [#62]b [#72]r [#61]a [#6d]m > > applying rule #225 > pattern=3Da > lcontext=3D > rcontext=3D[bcdgkpstwz=C5=BC] > subst=3D(A|B[128]) > result=3D(A[2012]|B[128]) > > applying rule #229 > pattern=3Db > lcontext=3D > rcontext=3D > subst=3Db > result=3D(Ab[2012]|Bb[128]) > > applying rule #245 > pattern=3Dr > lcontext=3D > rcontext=3D > subst=3Dr > result=3D(Abr[2012]|Bbr[128]) > > applying rule #228 > pattern=3Da > lcontext=3D > rcontext=3D > subst=3DA > result=3D(AbrA[2012]|BbrA[128]) > > applying rule #240 > pattern=3Dm > lcontext=3D > rcontext=3D > subst=3Dm > result=3D(AbrAm[2012]|BbrAm[128]) > > after language rules: (AbrAm[2012]|BbrAm[128]) > > > applying final rules from (exactapproxcommon plus approxcommon) to > AbrAm[2012] > no rules match for phonetic item 0 at position 0: A > no rules match for phonetic item 0 at position 1: Ab > no rules match for phonetic item 0 at position 2: Abr > no rules match for phonetic item 0 at position 3: AbrA > no rules match for phonetic item 0 at position 4: AbrAm > > applying final rules from (exactapproxcommon plus approxcommon) to > BbrAm[128] > no rules match for phonetic item 1 at position 0: B > no rules match for phonetic item 1 at position 1: Bb > no rules match for phonetic item 1 at position 2: Bbr > no rules match for phonetic item 1 at position 3: BbrA > no rules match for phonetic item 1 at position 4: BbrAm > > applying final rules from (approxany) to AbrAm[2012] > after applying final rule #97 to phonetic item #0 at position 0: > (a[2012]|o[2012]|Y[16]) pattern=3DA lcontext=3D rcontext=3D subst=3D(a|o|= Y[16]) > after applying final rule #0 to phonetic item #0 at position 1: > (ab[2012]|av[1024]|ob[2012]|ov[1024]|Yb[16]) pattern=3Db lcontext=3D rcon= text=3D > subst=3D(b|v[1024]) > no rules match for phonetic item 0 at position 2: > (ab[2012]|av[1024]|ob[2012]|ov[1024]|Yb[16])r > after applying final rule #93 to phonetic item #0 at position 3: > > (abra[2012]|abro[2012]|avra[1024]|avro[1024]|obra[2012]|obro[2012]|ovra[1= 024 > ]|ovro[1024]|Ybra[16]|Ybro[16]) pattern=3DA lcontext=3D rcontext=3D[fklmn= prst]$ > subst=3D(a|o) > no rules match for phonetic item 0 at position 4: > > (abra[2012]|abro[2012]|avra[1024]|avro[1024]|obra[2012]|obro[2012]|ovra[1= 024 > ]|ovro[1024]|Ybra[16]|Ybro[16])m > > applying final rules from (approxany) to BbrAm[128] > after applying final rule #22 to phonetic item #1 at position 0: > (o[2012]|om[128]|im[128]) pattern=3DB lcontext=3D rcontext=3D[bp] > subst=3D(o|om[128]|im[128]) > after applying final rule #0 to phonetic item #1 at position 1: > (ob[2012]|ov[1024]|omb[128]|imb[128]) pattern=3Db lcontext=3D rcontext=3D > subst=3D(b|v[1024]) > no rules match for phonetic item 1 at position 2: > (ob[2012]|ov[1024]|omb[128]|imb[128])r > after applying final rule #93 to phonetic item #1 at position 3: > > (obra[2012]|obro[2012]|ovra[1024]|ovro[1024]|ombra[128]|ombro[128]|imbra[= 128 > ]|imbro[128]) pattern=3DA lcontext=3D rcontext=3D[fklmnprst]$ subst=3D(a|= o) > no rules match for phonetic item 1 at position 4: > > (obra[2012]|obro[2012]|ovra[1024]|ovro[1024]|ombra[128]|ombro[128]|imbra[= 128 > ]|imbro[128])m > > > > > > > > resulting tokens: > > > > (abram|abrom|avram|avrom|obram|obrom|ovram|ovrom|Ybram|Ybrom|ombram|ombro= m|i > mbram|imbrom) > > --001a11c2f47cac3af604fb8e210a--