lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Joshua O'Madadhain" <>
Subject Re: Accentuated characters
Date Tue, 10 Dec 2002 20:56:34 GMT
On Tue, 10 Dec 2002, stephane vaucher wrote:

> I wish to implement a TokenFilter that will remove accentuated
> characters so for example '' will become 'e'. As I would rather not
> reinvent the wheel, I've tried to find something on the web and on the
> mailing lists. I saw a mention of a contrib that could do this (see
> but I don't see anything applicable.

It may depend on what kind of encoding you're working with.  (E.g., HTML
documents represent such characters in a different way than that of
Postscript documents.)  Probably the easiest way to handle this, if you
want to avoid such questions, would be to convert all your input documents
(and query text) to Java (Unicode) strings, and then do a
search-and-replace with the appropriate character-pair arguments.  (After
this is done, you would then do whatever Lucene processing (indexing,
query parsing, etc.) was appropriate.  I am not aware of any code that
does this, but it should be straightforward.
Good luck--

Joshua O'Madadhain Per
  Joshua O'Madadhain: Information Scientist, Musician, Philosopher-At-Tall
 It's that moment of dawning comprehension that I live for--Bill Watterson
My opinions are too rational and insightful to be those of any organization.

To unsubscribe, e-mail:   <>
For additional commands, e-mail: <>

View raw message