commons-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Yossi Tamari (JIRA)" <>
Subject [jira] [Commented] (CODEC-199) Bug in HW rule in Soundex
Date Thu, 30 Mar 2017 16:29:41 GMT


Yossi Tamari commented on CODEC-199:

The fact that the HW rule is forced IS a bug, in my mind, and one of the things I was trying
to fix. 
Read []. While the class pretends to implement Soundex,
it really implemented American Soundex. I have no problem with that, but then we should change
the class name. I was under the impression that the constant (and the parameterized constructors)
were there in order to be able to say that this is a generic Soundex implementation, with
a default behavior of American Soundex (which is fine, since this is what most people want).
But this was not the case before my second patch - the HW rule is part of American Soundex
only, but there was no way to disable it, or apply it to other letters.
The original default was wrong, and changing it is a feature. When somebody passes "01230120022455012623010202"
intentionally, the HW rule should be disabled. If you want to protect from the accidental
case, we can change the name of the constants, so the caching will not work. If you think
this change is too big for 1.11 and should only happen in 2.0, that is a fair argument, though
I personally don't think so at this moment (I am not sure what the rules for behavior change
on minor release are here).

> Bug in HW rule in Soundex
> -------------------------
>                 Key: CODEC-199
>                 URL:
>             Project: Commons Codec
>          Issue Type: Bug
>    Affects Versions: 1.10
>            Reporter: Yossi Tamari
>             Fix For: 1.11
>         Attachments: better.patch, soundex.patch
> The Soundex algorithm says that if two characters that map to the same code are separated
by H or W, the second one is not encoded.
> However, in the implementation (in Soundex.getMappingCode() line 191), a character that
is preceded by two characters that are either H or W, is not encoded, regardless of what the
last consonant was.
> Source:

This message was sent by Atlassian JIRA

View raw message