commons-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sebb (JIRA)" <>
Subject [jira] [Commented] (CODEC-199) Bug in HW rule in Soundex
Date Fri, 31 Mar 2017 08:00:50 GMT


Sebb commented on CODEC-199:

bq. this is exactly a use case for where H and W are treated as vowels

No, it's not. 
Vowels have the code '0' and are used to separate consonants with the same non-zero code.
In this case, vowels are completely ignored, i.e. are treated like HW.

Try the following test:

        Assert.assertEquals("L150", s.encode("Lippmann"));

This fails with the current code (generates "L155") unless you set A to behave like HW, i.e.
vowels need to be set to '#' (silent).
Try it and see.

Step 3 of the Wikipedia definition says "two letters with the same number separated by 'h'
or 'w' are coded as a single number, whereas such letters separated by a vowel are coded twice".
However the Genealogy defintion implies that such letters are coded as a single number for
HW *and* the vowels.

The output from the Wikipedia definition allows repeated digits.
The Genealogy definition explicitly does not. That means it does not have any Wiki-style vowels;
for the Genealogy definition vowels + HW are all silent.

I have yet to see a definiton that requires HW to be treated as a vowel rather than silent
(or a consonant).
If you find any examples, please provide links (and test cases if possible).

> Bug in HW rule in Soundex
> -------------------------
>                 Key: CODEC-199
>                 URL:
>             Project: Commons Codec
>          Issue Type: Bug
>    Affects Versions: 1.10
>            Reporter: Yossi Tamari
>             Fix For: 1.11
>         Attachments: better.patch, soundex.patch
> The Soundex algorithm says that if two characters that map to the same code are separated
by H or W, the second one is not encoded.
> However, in the implementation (in Soundex.getMappingCode() line 191), a character that
is preceded by two characters that are either H or W, is not encoded, regardless of what the
last consonant was.
> Source:

This message was sent by Atlassian JIRA

View raw message