lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Otis Gospodnetic <>
Subject Re: (Offtopic) The unicode name for a character
Date Wed, 22 Dec 2004 16:48:43 GMT
If you are not tied to Java, see 'unac' at
It's old, but if nothing else you could see how it works and rewrite it
in Java.  And if you can, you can donate it to Lucene Sandbox.


--- Peter Pimley <> wrote:

> Hi everyone,
> The Question:
> In Java generally, Is there an easy way to get the unicode name of a 
> character?  (e.g. "LATIN SMALL LETTER A" from 'a')
> The Reasoning (for those who are interested):
> The documents I'm indexing have quite a lot of characters that are 
> basically variations on the basic A-Z ones.  In my analysis step, I'd
> like to convert these to their closest equivalent in the basic A-Z
> set.
> For some letters, this is easy.  An example is the e-acute character 
> (00E9 LATIN SMALL LETTER E WITH ACUTE).  I'd like to turn that into 
> plain 'e'.  I can do that by using the IBM ICU4J tools to decompose
> the 
> single character into two; 'e' and 0301 COMBINING ACUTE ACCENT.  Then
> I 
> can strip all characters that fail Character.isLetterOrDigit.  That 
> works fine.
> Some characters however do not decompose.  An example is the
> character 
> 01A4 LATIN CAPITAL LETTER P WITH HOOK.  I'd like to replace that with
> 'P', but it does not decompose into P + something.
> I'm considering taking the unicode name for each character I
> encounter 
> and regexping it against something like:
> ^LATIN .* LETTER (.) WITH .*$
> ... to try and extract the single A-Z|a-z character.
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message