commons-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Gary D. Gregory (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CODEC-125) Implement a Beider-Morse phonetic matching codec
Date Wed, 27 Jul 2011 18:06:09 GMT

    [ https://issues.apache.org/jira/browse/CODEC-125?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13071893#comment-13071893
] 

Gary D. Gregory commented on CODEC-125:
---------------------------------------

I hope Matthew can fix this but here is what I found out so far.

I added:

{code:java}
    @Ignore
    @Test
    public void testLongestEnglishSurname() throws EncoderException {
        BeiderMorseEncoder bmpm = new BeiderMorseEncoder();
        bmpm.setNameType(NameType.GENERIC);
        bmpm.setRuleType(RuleType.APPROX);
        bmpm.encode("MacGhilleseatheanaich");
    }
{code}

Indeed takes forever (I killed it after a couple of minutes.) When I suspend the test in the
debugger, it appears to spend its time in:

PhoneticEngine.normalizeLanguageAttributes(String, boolean) line: 266

{code:java}
    private String normalizeLanguageAttributes(final String input, final boolean strip) {
        String text = input;
        Set<String> langs = new HashSet<String>();

        int bracketStart;
        while ((bracketStart = text.indexOf('[')) != -1) {
            int bracketEnd = text.indexOf(']', bracketStart);
            if (bracketEnd == -1) {
                throw new IllegalArgumentException("no closing square bracket in: " + text);
            }

            String body = text.substring(bracketStart + 1, bracketEnd);
            langs.addAll(Arrays.asList(body.split("[+]")));
            text = text.substring(0, bracketStart) + text.substring(bracketEnd + 1);
        }

        if (langs.isEmpty() || strip) {
            return text;
        } else if (langs.contains(Languages.ANY)) {
            return "[" + Languages.ANY + "]";
        } else {
            return text + "[" + join(langs, "+") + "]";
        }
    }
{code}	

The input String is 8,722,727 chars long!

No wonder it takes forever!

Matthew: Can this be?



> Implement a Beider-Morse phonetic matching codec
> ------------------------------------------------
>
>                 Key: CODEC-125
>                 URL: https://issues.apache.org/jira/browse/CODEC-125
>             Project: Commons Codec
>          Issue Type: New Feature
>            Reporter: Matthew Pocock
>            Priority: Minor
>         Attachments: bm-gg.diff, bmpm.patch, bmpm.patch, bmpm.patch, bmpm.patch, bmpm.patch,
bmpm.patch, bmpm.patch, bmpm.patch
>
>
> I have implemented Beider Morse Phonetic Matching as a codec against the commons-codec
svn trunk. I would like to contribute this to commons-codec.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message