commons-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Black <>
Subject [Codec] accented character soundex revisited
Date Wed, 15 Feb 2006 21:28:07 GMT
Over 18 months ago there was a thread on this list about the proper 
handling of accented characters in the Soundex encoder in commons-codec 
but it never seemed to get resolved. In addition, there are still 
failing unit tests that reference this issue in the current version of 
the code. As someone who uses this code, I'd like to see all unit tests 
passing, so I've done some investigation.
As a refresher, there were three options discussed for the behavior of 
the Soundex codec when it sees an accented character:
1) Throw an IllegalArgumentException
2) Drop it silently
3) Replace it with the equivalent unaccented character

Right now the code drops it silently, but the unit tests are expecting 
an IllegalArgumentException. The code in ch) seems to 
be trying to throw this exception, but it will never happen because the 
characters passed to it from Soundex.soundex are from a String that has 
gone through SoundexUtils.clean(String str) which removes all characters 
that fail a Character.isCharacter(char ch) check (accented chars fail 
this check, I, erm, checked). This means if we want to throw an 
IllegalArgumentException it must be done in SoundexUtils.clean, not

I think either behaviors 1 or 2 (drop silently, which is what we 
currently do) would be easy to implement and then change the unit tests 
to match the behavior so all unit tests on commons-codec pass.

If someone lets me know which behavior is desired I will submit a patch. 
Note that behavior 2 only requires either removing the test cases or 
changing them to expect the same encoding as an empty string.



To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message