commons-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gary Gregory <GGreg...@seagullsoftware.com>
Subject FW: [codec] Testing Cologne Phonetic
Date Tue, 22 Feb 2011 19:44:45 GMT
For the record...

Gary Gregory
Senior Software Engineer
Rocket Software
3340 Peachtree Road, Suite 820 . Atlanta, GA 30326 . USA
Tel: +1.404.760.1560
Email: ggregory@seagullsoftware.com
Web: seagull.rocketsoftware.comĀ  



> -----Original Message-----
> From: F Mue [mailto:webmaster@genealogie-konzepte.net]
> Sent: Tuesday, February 22, 2011 13:58
> To: Gary Gregory
> Subject: Re: [codec] Testing Cologne Phonetic
> 
> Hi Gary,
> 
> my understanding of applying the algorithm in Wikipedia to the word
> "deutsch":
> 
> Step 1: Letter-by-letter coding
>     d -> 2
>     e -> 0
>     u -> 0
>     t -> 8 (D, T: before C, S, Z)
>     s -> 8
>     c -> 8 (C: after S, Z)
>     h -> -
> 
> Step 2: Removing multiple, consecutive digits
>     200888 -> 208
> 
> Step 3: Removing all "0" digits except the leading one
>     208 -> 28
> 
> So in my opinion the result of the Apache implementation is correct, and
> the PHP result is wrong.
> 
> Well, the result is showing me that I can't trust both PHP
> implementations (magdev.de as well as the implementation I am using in
> my family name dataset - which also produces the result 288). This means
> I either have to rewrite the PHP implementation or wait for a reliable
> Apache Commons implementation :-)
> 
> 
> Franz
> 
> 
> 
> Am 22.02.2011 18:43, schrieb Gary Gregory:
> >> -----Original Message-----
> >> From: F Mue [mailto:webmaster@genealogie-konzepte.net]
> >> Sent: Tuesday, February 22, 2011 10:37
> >> To: Gary Gregory
> >> Subject: Re: [codec] Testing Cologne Phonetic
> >>
> >> Hi Gary,
> >>
> >> I don't think a re-write or modification would be a big issue. Of course
> >> I would reuse the skeleton of the old code. The major part is about
> >> going through the algorithm and figuring out what rules to apply in what
> >> order. Maybe I have enough time left in March to do that.
> >>
> >> The real problem probably in my opinion is how to make sure the code is
> >> correct, i. e. find valid test data (including test results) ... the
> >> same problem you have :-)
> > Hi Franz,
> >
> > Yes, that's the problem, finding baseline data!
> >
> > For other encoders, I thought about using a database's SOUNDEX function
> (for example) to generate some data for comparison. But I do not think any
> DBs implement the Cologne Phonetic algorithm.
> >
> >> It might be easier for me to try the implementation in
> >>     http://www.magdev.de/text_colognephonetic/
> >> But it's alpha code. I can't be sure it's producing correct code. Well,
> >> I could try to compare the results of that implementation to my code
> >> results of the current release 0.11.2 of my family names data...
> > This unit test data works for us except for "deutsch", where we get 28
> instead of the PHP unit test which expects 288.
> >
> > It looks like a bug in our code. From my reading of the Wikipedia table,
> the code should indeed be 288. Can you confirm that please?
> >
> > Thank you again,
> > Gary
> >

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org


Mime
View raw message