lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From stephane vaucher <vauc...@LUB.UMontreal.CA>
Subject Re: Accentuated characters
Date Thu, 12 Dec 2002 17:01:43 GMT
Actually, I'm just looking to remove accentuated chars from java chars 
(so Unicode), only for the search (original doc should stay the same as 
I display), I'll just implement a TokenFilter to do this. It should be 
relatively simple. Just wanted to know if it had already been done 
(perhaps in a generic way).


Joshua O'Madadhain wrote:

>On Tue, 10 Dec 2002, stephane vaucher wrote:
>>I wish to implement a TokenFilter that will remove accentuated
>>characters so for example 'é' will become 'e'. As I would rather not
>>reinvent the wheel, I've tried to find something on the web and on the
>>mailing lists. I saw a mention of a contrib that could do this (see
>>but I don't see anything applicable.
>It may depend on what kind of encoding you're working with.  (E.g., HTML
>documents represent such characters in a different way than that of
>Postscript documents.)  Probably the easiest way to handle this, if you
>want to avoid such questions, would be to convert all your input documents
>(and query text) to Java (Unicode) strings, and then do a
>search-and-replace with the appropriate character-pair arguments.  (After
>this is done, you would then do whatever Lucene processing (indexing,
>query parsing, etc.) was appropriate.  I am not aware of any code that
>does this, but it should be straightforward.
>Good luck--
>Joshua O'Madadhain
> Per
>  Joshua O'Madadhain: Information Scientist, Musician, Philosopher-At-Tall
> It's that moment of dawning comprehension that I live for--Bill Watterson
>My opinions are too rational and insightful to be those of any organization.
>To unsubscribe, e-mail:   <>
>For additional commands, e-mail: <>

To unsubscribe, e-mail:   <>
For additional commands, e-mail: <>

View raw message