commons-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Gary D. Gregory (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CODEC-125) Implement a Beider-Morse phonetic matching codec
Date Fri, 01 Jul 2011 02:14:28 GMT

    [ https://issues.apache.org/jira/browse/CODEC-125?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13058180#comment-13058180
] 

Gary D. Gregory commented on CODEC-125:
---------------------------------------

Hi Matthiew:

Thank you for the 2nd patch.

I just spent an hour trying to figure this out tonight and here is how far I got using the
latest patch dated 29/Jun/11 21:40.

First a code comment: Let's put all the code in the language.bm package instead of splitting
it up in two packages. I prefer the name bm for now, the pm seems redundant under {{language}}.


The seems to be a fundamental problem with Regex strings in this patch.

The first error I ran into was in {{lang.txt}}. The first non-comment line is:

{{^o’ english true}}

which gave a {{PatternSyntaxException}} I can no longer reproduce.

I commented that line out. Then I get a {{PatternSyntaxException}} on lines that start with
{{?}}. This makes sense since {{?}} is qualifier.

So I hacked the code to skip lines that start with {{?}} in Lang.java like so:

{code:java}
            Pattern pattern = null;
            final String regex = parts[0];
            if (regex.charAt(0) != '?') {
                try {
                    pattern = Pattern.compile(regex);
                } catch (PatternSyntaxException e) {
                    throw new IllegalArgumentException("Error compiling regex at line " +
lineNo + ": " + regex, e);
                }
                String[] langs = parts[1].split("\\+");
                boolean accept = parts[2].equals("true");

                rules.add(new LangRule(pattern, new HashSet<String>(Arrays.asList(langs)),
accept));
            }
{code}

Next up is Rules.java which blows up due to the same {{?}} issue:

{noformat}
java.lang.ExceptionInInitializerError
	at org.apache.commons.codec.language.bmpm.PhoneticEngine.phoneticUtf8(PhoneticEngine.java:98)
	at org.apache.commons.codec.language.bmpm.PhoneticEngine.encode(PhoneticEngine.java:85)
	at org.apache.commons.codec.language.PhoneticTest.testPhonetic(PhoneticTest.java:72)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
	at java.lang.reflect.Method.invoke(Method.java:597)
	at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:44)
	at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:15)
	at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:41)
	at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:20)
	at org.junit.runners.BlockJUnit4ClassRunner.runNotIgnored(BlockJUnit4ClassRunner.java:79)
	at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:71)
	at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:49)
	at org.junit.runners.ParentRunner$3.run(ParentRunner.java:193)
	at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:52)
	at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:191)
	at org.junit.runners.ParentRunner.access$000(ParentRunner.java:42)
	at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:184)
	at org.junit.runners.ParentRunner.run(ParentRunner.java:236)
	at org.junit.runners.Suite.runChild(Suite.java:128)
	at org.junit.runners.Suite.runChild(Suite.java:24)
	at org.junit.runners.ParentRunner$3.run(ParentRunner.java:193)
	at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:52)
	at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:191)
	at org.junit.runners.ParentRunner.access$000(ParentRunner.java:42)
	at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:184)
	at org.junit.runners.ParentRunner.run(ParentRunner.java:236)
	at org.eclipse.jdt.internal.junit4.runner.JUnit4TestReference.run(JUnit4TestReference.java:50)
	at org.eclipse.jdt.internal.junit.runner.TestExecution.run(TestExecution.java:38)
	at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:467)
	at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:683)
	at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.run(RemoteTestRunner.java:390)
	at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:197)
Caused by: java.lang.IllegalArgumentException: Error compiling regex: (?|?|?|?|?|?|?|?|?|?)
	at org.apache.commons.codec.language.bmpm.Rule.<init>(Rule.java:46)
	at org.apache.commons.codec.language.bmpm.Rule.parseRules(Rule.java:215)
	at org.apache.commons.codec.language.bmpm.Rule.<clinit>(Rule.java:129)
	... 34 more
Caused by: java.util.regex.PatternSyntaxException: Unknown inline modifier near index 2
(?|?|?|?|?|?|?|?|?|?)$
  ^
	at java.util.regex.Pattern.error(Pattern.java:1713)
	at java.util.regex.Pattern.group0(Pattern.java:2519)
	at java.util.regex.Pattern.sequence(Pattern.java:1806)
	at java.util.regex.Pattern.expr(Pattern.java:1752)
	at java.util.regex.Pattern.compile(Pattern.java:1460)
	at java.util.regex.Pattern.<init>(Pattern.java:1133)
	at java.util.regex.Pattern.compile(Pattern.java:823)
	at org.apache.commons.codec.language.bmpm.Rule.<init>(Rule.java:44)
	... 36 more
{noformat}

I hacked this class as well to surface the underlying RE:

{code:java}
    public Rule(String pattern, String lContext, String rContext, String phoneme, Set<String>
language, String logical)
    {
        this.pattern = pattern;
        try {
            this.lContext = Pattern.compile(lContext + "$");
        } catch (PatternSyntaxException e) {
            throw new IllegalArgumentException("Error compiling regex: " + lContext, e);
        }
{code}

Then I quit.

I do not understand how this can work for you and not for me.

What happens when you run:

{code:java}
import java.util.regex.Pattern;

public class PatternTest {

    /**
     * @param args
     */
    public static void main(String[] args) {
        System.out.println(Pattern.compile("?"));
    }

}
{code}

I get:

{noformat}
Exception in thread "main" java.util.regex.PatternSyntaxException: Dangling meta character
'?' near index 0
?
^
	at java.util.regex.Pattern.error(Pattern.java:1713)
	at java.util.regex.Pattern.sequence(Pattern.java:1878)
	at java.util.regex.Pattern.expr(Pattern.java:1752)
	at java.util.regex.Pattern.compile(Pattern.java:1460)
	at java.util.regex.Pattern.<init>(Pattern.java:1133)
	at java.util.regex.Pattern.compile(Pattern.java:823)
	at PatternTest.main(PatternTest.java:9)
{noformat}

How about you?

> Implement a Beider-Morse phonetic matching codec
> ------------------------------------------------
>
>                 Key: CODEC-125
>                 URL: https://issues.apache.org/jira/browse/CODEC-125
>             Project: Commons Codec
>          Issue Type: New Feature
>            Reporter: Matthew Pocock
>            Priority: Minor
>         Attachments: bmpm.patch, bmpm.patch
>
>
> I have implemented Beider Morse Phonetic Matching as a codec against the commons-codec
svn trunk. I would like to contribute this to commons-codec.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

Mime
View raw message