lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Binkley, Peter" <>
Subject RE: UTF8 accents & umlauts filter?
Date Thu, 14 Sep 2006 17:45:59 GMT
We use ICU4J to do the filtering based on Unicode blocks. See for a sense of what
you can do. It's worth it for us because we need to normalize cyrillic
as well as roman text; it might be overkill for other situations. But it
does good work. The first example on the page linked above shows
accent-stripping: you normalize to NFD (decomposed unicode, where
accents are represented as non-spacing characters), then delete all the
non-spacing characters, and finally normalize back to composed unicode.


-----Original Message-----
From: Michael Imbeault [] 
Sent: Wednesday, September 13, 2006 9:34 PM
Subject: Re: UTF8 accents & umlauts filter?

Thanks Yonik & Ken for both answers; I think the explanations went a
little over my head, but I think you understood what I was talking
about! Basically, a better filter to remove all possible accents (&
umlauts as a bonus, for completeness sake; I personally would have no
use for it).

I think it's way more work and way more complicated than I initially
thought it would be. Anyone feels able to do this?

Michael Imbeault
CHUL Research Center (CHUQ)
2705 boul. Laurier
Ste-Foy, QC, Canada, G1V 4G2
Tel: (418) 654-2705, Fax: (418) 654-2212

Yonik Seeley wrote:
> Thanks for the links Michael... this one does look interesting:
> The challenge would be to make it fast... perhaps a custom hash table,

> or look into the cost of a perfect hash function.
> Just to clear up some unicode/terminology issues:
> There are latin1 characters (the actual glyphs) represented by unicode

> code points 0->255 There is also a latin1 encoding for unicode (which 
> can only represent unicode code points 0->255)
> UTF8 is another encoding for unicode characters (or code points), but 
> that's not really relevant to a filter.
> So ISOLatin1AccentFilter removes accents from characters <= 255, and 
> it doesn't matter what the original encoding was (ascii, latin1, UTF8,

> UTF16, etc)
> -Yonik
> On 9/12/06, Michael Imbeault <> wrote:
>> Right now Lucene has an accent filter (ISOLatin1AccentFilter) that 
>> remove accents on ISO-8859-1 text. What about a UTF8AccentFilter? Is 
>> it planned to add such a filter (which would be very useful, as 
>> ISOLatin1AccentFilter isn't able to remove some complex accents on 
>> some languages encoded in UTF8. I would paste examples but I'm not 
>> sure that they would display correctly).? I think I saw a post long 
>> ago on this mailing list about something like that, but it has never 
>> been released officially.
>> See
>> 2001, first post about utf8 accents:
>> ing=accent;#648
>> 2004, a good solution, but still incomplete :
>> tring=accent;#10792
>> 2006, best attempt yet, but sadly undelivered :
>> tring=accent;#32142
>> I think Lucene would benefit from a complete UTF8 accents remover...
>> right now the best solution I have is to process everything in PHP 
>> before indexing and at query time (and its a little slow).
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message