commons-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Gary Gregory" <ggreg...@seagullsw.com>
Subject RE: [codec] Soudex issue with accented character.
Date Wed, 02 Jun 2004 17:23:25 GMT
I agree that behavior should be consistent; we just need to define what this behavior is. 

Whoever initially created DoubleMetaphone went to great pains to handle some of these characters
(from the original C source?) and it seems wrong to take out support for "ç".

>From reading this thread and looking a Soundex source I think:

For Soundex you can plug in a mapping, which means behavior for *un*mapped characters needs
to be defined. Today that behavior is an IAE, which seems reasonable since it tells you that
your expectations of processing your input data cannot be matched. If you want more characters,
you provide your own mapping. The issue then becomes: is our current manner of plugging in
a mapping reasonable? I believe this question can be addressed post 1.3 if you buy into my
reasoning.

Not sure about the other encodings...

Gary 

> -----Original Message-----
> From: Edelson, Justin [mailto:Justin.Edelson@mtvi.com]
> Sent: Wednesday, June 02, 2004 07:56
> To: Jakarta Commons Developers List
> Subject: RE: [codec] Soudex issue with accented character.
> 
> The only "better" solution I can think of is to map the characters into
> their non-accented equivalent. While I think it's important to state that
> the default Soundex implementation is for English words, it would be nice
> to accommodate words with accented characters.
> 
> My bigger concern is that the behavior is inconsistent between Soundex,
> Metaphone, & DoubleMetaphone. Soundex will not throw an
> IllegalArgumentException, whereas Metaphone passes through the "bad"
> character. DoubleMetaphone has support for two accented characters, C with
> Cedilla and N with tilde.
> 
> To the extent that I think the language codecs should be swappable
> components, it's a good idea for the support to be consistent. To that
> end, a String passed to any of the codecs should either throw an exception
> for all or none.
> 
> Just my 2 cents.
> 
> 
> -----Original Message-----
> From: Gary Gregory [mailto:ggregory@seagullsw.com]
> Sent: Sunday, May 23, 2004 8:37 PM
> To: Jakarta Commons Developers List
> Subject: [codec] Soudex issue with accented character.
> 
> 
> http://nagoya.apache.org/bugzilla/show_bug.cgi?id=29080
> 
> Currently, "ö" or "é" in a String causes Soundex to throw an
> ArrayIndexOutOfBoundsException.
> 
> We can either:
> 
> (1) Throw a better Exception, like IllegalArgumentException: Only 'plain'
> letter are allowed.
> 
> Or:
> 
> (2) Ignore unmapped characters. This would work for "ö" and "é" since
> vowels are ignored but this could cause bad encoding values for other
> chars like "ç".
> 
> AFAIK, you cannot ask if a character is a vowel or not.
> 
> Thoughts?
> 
> Gary
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: commons-dev-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: commons-dev-help@jakarta.apache.org
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: commons-dev-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: commons-dev-help@jakarta.apache.org
> 



---------------------------------------------------------------------
To unsubscribe, e-mail: commons-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: commons-dev-help@jakarta.apache.org


Mime
View raw message