Mailing-List: contact issues-help@commons.apache.org; run by ezmlm
Precedence: bulk
Reply-To: issues@commons.apache.org
Date: Fri, 18 Jul 2014 07:33:07 +0000 (UTC)
From: "michael tobias (JIRA)" <jira@apache.org>
To: issues@commons.apache.org
Message-ID: <JIRA.12720471.1402440563128.69890.1405668787046@arcas>
In-Reply-To: <JIRA.12720471.1402440563128@arcas>
References: <JIRA.12720471.1402440563128@arcas>
Subject: [jira] [Commented] (CODEC-187) Beider Morse Phonetic Matching
 producing incorrect tokens
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit


    [ https://issues.apache.org/jira/browse/CODEC-187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14066138#comment-14066138 ] 

michael tobias commented on CODEC-187:
--------------------------------------

Sorry I have not had much time do more testing on this.  As I said previously all known bugs have been sorted in the latest build - but that doesnt mean there are none left.

I have been doing my testing on a genealogical database of 4 million records which have been indexed in Solr and also in another system with BMPM implemented.  The number of matches to searches in both systems are near-identical thus far but I have not drilled down to compare the actual BMPM tokens.  However even if the list of tokens generated are not a perfect match the results of the 2 sets of searches suggest that there is sufficient overlap in all cases for the results to be broadly equivalent.

I would like to do more testing but I am really busy preparing for a conference and will be out of action for several weeks so can do no more at this time.....

Regards

M

> Beider Morse Phonetic Matching producing incorrect tokens
> ---------------------------------------------------------
>
>                 Key: CODEC-187
>                 URL: https://issues.apache.org/jira/browse/CODEC-187
>             Project: Commons Codec
>          Issue Type: Bug
>    Affects Versions: 1.9
>            Reporter: michael tobias
>            Priority: Minor
>             Fix For: 1.10
>
>         Attachments: CODEC-187.patch, CODEC-187_ashkenazi_approx_any.patch, CODEC-187_ashkenazi_approx_any_v2.patch, CODEC_187_sync_with_v3.3.diff
>
>
> I believe the Beider Morse Phonetic Matching algorithm was added in Commons Codec 1.6
> The BMPM algorithm is an EVOLVING algorithm that is currently on version 3.02 though it had been static since version 3.01 dated 19 Dec 2011 (it was first available as opensource as version 1.00 on 6 May 2009).
> I can see nothing in the Commons Codec Docs to say which version of BMPM was implemented so I am not sure if the problem with the algorithm as coded in the Codec is simply an old version or whether there are more basic problems with the implementation.
> How do I determine the version of the algorithm that was implemented in the Commons Codec?
> How do we ensure that the algorithm is updated if/when the BMPM algorithm changes?
> How do we ensure that the algorithm as coded in the Commons Codec is accurate and working as expected?


--
This message was sent by Atlassian JIRA
(v6.2#6252)