commons-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Thomas Neidhart (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CODEC-187) Beider Morse Phonetic Matching producing incorrect tokens
Date Fri, 13 Jun 2014 21:27:03 GMT

    [ https://issues.apache.org/jira/browse/CODEC-187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14031191#comment-14031191
] 

Thomas Neidhart commented on CODEC-187:
---------------------------------------

This has to be further analysed, but I doubt that the algorithm / code has changed at all.
The Beider Morse phonetic encoder is a generic rule-based replacement algorithm with domain-specific
rules.

Having said this, I could imagine that we add different versions of the rules and allow the
user to create instances of the BeiderMorseEncoder using different rulesets. Just keep in
mind that the current ruleset is approx. 548kB uncompressed and ~115kB compressed, which means
if we add multiple versions this would further increase the size of the jar file.

Furthermore, if updates to the rules just result in more tokens to be returned, no re-indexing
would be necessary imho (it might create better results though).

> Beider Morse Phonetic Matching producing incorrect tokens
> ---------------------------------------------------------
>
>                 Key: CODEC-187
>                 URL: https://issues.apache.org/jira/browse/CODEC-187
>             Project: Commons Codec
>          Issue Type: Bug
>    Affects Versions: 1.9
>            Reporter: michael tobias
>            Priority: Minor
>             Fix For: 1.10
>
>         Attachments: CODEC-187.patch
>
>
> I believe the Beider Morse Phonetic Matching algorithm was added in Commons Codec 1.6
> The BMPM algorithm is an EVOLVING algorithm that is currently on version 3.02 though
it had been static since version 3.01 dated 19 Dec 2011 (it was first available as opensource
as version 1.00 on 6 May 2009).
> I can see nothing in the Commons Codec Docs to say which version of BMPM was implemented
so I am not sure if the problem with the algorithm as coded in the Codec is simply an old
version or whether there are more basic problems with the implementation.
> How do I determine the version of the algorithm that was implemented in the Commons Codec?
> How do we ensure that the algorithm is updated if/when the BMPM algorithm changes?
> How do we ensure that the algorithm as coded in the Commons Codec is accurate and working
as expected?



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Mime
View raw message