commons-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Cédrik LIME (JIRA) <j...@apache.org>
Subject [jira] Commented: (LANG-285) Wish : method unaccent
Date Wed, 20 Feb 2008 14:27:44 GMT

    [ https://issues.apache.org/jira/browse/LANG-285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12570701#action_12570701
] 

Cédrik LIME commented on LANG-285:
----------------------------------

Here is a pure UNICODE version, which is complementary to the "big list of accentuated chars"
previously described.
One could probably use some reflection to see which implementation is available, and fall
back to the "search and replace" method if nothing more is available.


Java 6+ (beware, java.text.Normalizer exists in Java 1.3, but is incompatible!):

private static final Pattern sunPattern =
Pattern.compile("\\p{InCombiningDiacriticalMarks}+");//$NON-NLS-1$

String decomposed = java.text.Normalizer.normalize(string, Normalizer.Form.NFD);
return  sunPattern.matcher(decomposed).replaceAll("");//$NON-NLS-1$



SUN internal, Java 1.3 to 1.5:

private static final Pattern sunPattern =
Pattern.compile("\\p{InCombiningDiacriticalMarks}+");//$NON-NLS-1$

String result = sun.text.Normalizer.decompose(text, false, 0);
result = sunPattern.matcher(result).replaceAll("");//$NON-NLS-1$



IBM ICU4J (http://www.icu-project.org/):

private static final com.ibm.icu.text.Transliterator accentsRemover =
Transliterator.getInstance("NFD; [:Nonspacing Mark:] Remove; NFC; ");//$NON-NLS-1$

return accentsRemover.transliterate(text);

> Wish : method unaccent
> ----------------------
>
>                 Key: LANG-285
>                 URL: https://issues.apache.org/jira/browse/LANG-285
>             Project: Commons Lang
>          Issue Type: New Feature
>            Reporter: Guillaume Coté
>            Priority: Minor
>             Fix For: 3.0
>
>         Attachments: MapBuilder.java, unaccent.patch, UnnacentMap.java
>
>
> I would like to add a method that replace accented caracter by unaccented one.  For example,
with the input String "L'été où j'ai dû aller à l'île d'Anticosti commenca tôt", the
method would return "L'ete ou j'ai du aller à l'ile d'Anticosti commenca tot".
> I suggest to call that method unaccent and to add it in StringUtils.
> If we cannot covert all case, the first version could only covert iso-8859-1.
> If you are willing to go forward with that idea, I am willing to contribute a patch.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message