commons-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "michael tobias (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CODEC-187) Beider Morse Phonetic Matching producing incorrect tokens
Date Wed, 11 Jun 2014 10:18:02 GMT

    [ https://issues.apache.org/jira/browse/CODEC-187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14027597#comment-14027597
] 

michael tobias commented on CODEC-187:
--------------------------------------

I am not aware of any test data set to run comparisons against.  There are however some online
tools (written by Steve Morse) where you can generate the tokens for any words but until yesterday
these tools only produced tokens based on the latest version of the BMPM code and only used
GENERIC, APPROX.  Hence my questions about what version of the algorithm had actually been
coded into the Commons Codec.

Steve has now produced a modified system where you can select any of the code versions and
the types of tokens required.

This can be found at:

http://stevemorse.org/census/soundex.html

While playing with this new tool I realised that the encoding of the sample I was using (abram)
had not changed since 2009 and so the problem in the Commons Codec was likely NOT a version
issue.

I dont think there is a changes log between versions (but I am chasing Steve on this) but
I suspect almost all changes made since 2009 affect only a few specific words/letter combinations,
though there might be more changes for specific languages.

A modified version of this message can be posted to dev if you think it worthwhile.


> Beider Morse Phonetic Matching producing incorrect tokens
> ---------------------------------------------------------
>
>                 Key: CODEC-187
>                 URL: https://issues.apache.org/jira/browse/CODEC-187
>             Project: Commons Codec
>          Issue Type: Bug
>    Affects Versions: 1.9
>            Reporter: michael tobias
>            Priority: Minor
>
> I believe the Beider Morse Phonetic Matching algorithm was added in Commons Codec 1.6
> The BMPM algorithm is an EVOLVING algorithm that is currently on version 3.02 though
it had been static since version 3.01 dated 19 Dec 2011 (it was first available as opensource
as version 1.00 on 6 May 2009).
> I can see nothing in the Commons Codec Docs to say which version of BMPM was implemented
so I am not sure if the problem with the algorithm as coded in the Codec is simply an old
version or whether there are more basic problems with the implementation.
> How do I determine the version of the algorithm that was implemented in the Commons Codec?
> How do we ensure that the algorithm is updated if/when the BMPM algorithm changes?
> How do we ensure that the algorithm as coded in the Commons Codec is accurate and working
as expected?



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Mime
View raw message