commons-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "C. Scott Ananian" <>
Subject RE: [codec] Soudex issue with accented character.
Date Wed, 02 Jun 2004 15:52:10 GMT
On Wed, 2 Jun 2004, C. Scott Ananian wrote:
> On Wed, 2 Jun 2004, Edelson, Justin wrote:
> > Agreed, but that only addresses vowels, not (for example) N with tilde
> > or C with cedilla.
> And these are fundamentally different from, say, 'W' in what way exactly?

My point being that soundex is a hashing and summarizing algorithm:
omitting characters does not break the algorithm, it just means that the
set of words which "sound like" the given one is slightly larger.
For example, "pi~nata" and "pout" would soundex the same.  But misspelling
"pi~nata" as "pi~nada" would still yield the correct match.

Trying to figure out which letters a given "funny" character "sounds like"
is misguided (IMO) -- c-with-cedilla sounds like s, not c. (Although
soundex represents both s and c with the same code.)  N-with-tilde sounds
like "ny" (although soundex drops the y).  I'm sure there are other examples.

However, I will admit that, since soundex is English-oriented, and English
tends to strip accents from imported words, it might be nice to not
only have na"ive and naive map to the same code (which dropping the
accented character still accomplishes) but also pi~nata and pinata
and francois and fran,cois.

So you could make an argument either way, but dropping the character is
not inconsistent with soundex, easier to describe in documentation, and
much easier to implement reliably (ie covering all possible 'funny'

Sabana Seca Saddam Hussein SDI Hawk President FBI Waco, Texas Sigint
assassination COBRA JANE milita Delta Force South Africa $400 million in gold bullion
                         ( )

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message