lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Peter Pimley <ppim...@semantico.com>
Subject (Offtopic) The unicode name for a character
Date Wed, 22 Dec 2004 10:52:27 GMT

Hi everyone,

The Question:
In Java generally, Is there an easy way to get the unicode name of a 
character?  (e.g. "LATIN SMALL LETTER A" from 'a')


The Reasoning (for those who are interested):
The documents I'm indexing have quite a lot of characters that are 
basically variations on the basic A-Z ones.  In my analysis step, I'd 
like to convert these to their closest equivalent in the basic A-Z set.

For some letters, this is easy.  An example is the e-acute character 
(00E9 LATIN SMALL LETTER E WITH ACUTE).  I'd like to turn that into 
plain 'e'.  I can do that by using the IBM ICU4J tools to decompose the 
single character into two; 'e' and 0301 COMBINING ACUTE ACCENT.  Then I 
can strip all characters that fail Character.isLetterOrDigit.  That 
works fine.

Some characters however do not decompose.  An example is the character 
01A4 LATIN CAPITAL LETTER P WITH HOOK.  I'd like to replace that with 
'P', but it does not decompose into P + something.

I'm considering taking the unicode name for each character I encounter 
and regexping it against something like:
^LATIN .* LETTER (.) WITH .*$
... to try and extract the single A-Z|a-z character.


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message