Return-Path: Delivered-To: apmail-jakarta-lucene-user-archive@apache.org Received: (qmail 55685 invoked from network); 10 Dec 2002 20:51:28 -0000 Received: from unknown (HELO nagoya.betaversion.org) (192.18.49.131) by daedalus.apache.org with SMTP; 10 Dec 2002 20:51:28 -0000 Received: (qmail 9300 invoked by uid 97); 10 Dec 2002 20:52:38 -0000 Delivered-To: qmlist-jakarta-archive-lucene-user@jakarta.apache.org Received: (qmail 9260 invoked by uid 97); 10 Dec 2002 20:52:35 -0000 Mailing-List: contact lucene-user-help@jakarta.apache.org; run by ezmlm Precedence: bulk List-Unsubscribe: List-Subscribe: List-Help: List-Post: List-Id: "Lucene Users List" Reply-To: "Lucene Users List" Delivered-To: mailing list lucene-user@jakarta.apache.org Received: (qmail 9246 invoked by uid 98); 10 Dec 2002 20:52:35 -0000 X-Antivirus: nagoya (v4218 created Aug 14 2002) Date: Tue, 10 Dec 2002 12:56:34 -0800 (Pacific Standard Time) From: "Joshua O'Madadhain" To: Lucene Users List Subject: Re: Accentuated characters In-Reply-To: <3DF64757.7030106@lub.umontreal.ca> Message-ID: X-X-Sender: jmadden@smtp.ics.uci.edu MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=X-UNKNOWN Content-Transfer-Encoding: QUOTED-PRINTABLE X-MailScanner: Found to be clean X-ICS-MailScanner-SpamCheck: not spam, SpamAssassin (score=-3.3, required 5, EMAIL_ATTRIBUTION, IN_REP_TO, QUOTED_EMAIL_TEXT, SPAM_PHRASE_01_02, USER_AGENT_PINE) X-Spam-Rating: daedalus.apache.org 1.6.2 0/1000/N X-Spam-Rating: daedalus.apache.org 1.6.2 0/1000/N On Tue, 10 Dec 2002, stephane vaucher wrote: > I wish to implement a TokenFilter that will remove accentuated > characters so for example '=E9' will become 'e'. As I would rather not > reinvent the wheel, I've tried to find something on the web and on the > mailing lists. I saw a mention of a contrib that could do this (see > http://www.mail-archive.com/lucene-user%40jakarta.apache.org/msg02146.htm= l), > but I don't see anything applicable. It may depend on what kind of encoding you're working with. (E.g., HTML documents represent such characters in a different way than that of Postscript documents.) Probably the easiest way to handle this, if you want to avoid such questions, would be to convert all your input documents (and query text) to Java (Unicode) strings, and then do a search-and-replace with the appropriate character-pair arguments. (After this is done, you would then do whatever Lucene processing (indexing, query parsing, etc.) was appropriate. I am not aware of any code that does this, but it should be straightforward. =20 Good luck-- Joshua O'Madadhain jmadden@ics.uci.edu...Obscurium Per Obscurius...www.ics.uci.edu/~jmadden Joshua O'Madadhain: Information Scientist, Musician, Philosopher-At-Tall It's that moment of dawning comprehension that I live for--Bill Watterson My opinions are too rational and insightful to be those of any organization= =2E -- To unsubscribe, e-mail: For additional commands, e-mail: