Return-Path: Delivered-To: apmail-jakarta-lucene-user-archive@apache.org Received: (qmail 19654 invoked from network); 12 Dec 2002 17:13:36 -0000 Received: from unknown (HELO nagoya.betaversion.org) (192.18.49.131) by daedalus.apache.org with SMTP; 12 Dec 2002 17:13:36 -0000 Received: (qmail 4249 invoked by uid 97); 12 Dec 2002 17:13:42 -0000 Delivered-To: qmlist-jakarta-archive-lucene-user@jakarta.apache.org Received: (qmail 4029 invoked by uid 97); 12 Dec 2002 17:13:35 -0000 Mailing-List: contact lucene-user-help@jakarta.apache.org; run by ezmlm Precedence: bulk List-Unsubscribe: List-Subscribe: List-Help: List-Post: List-Id: "Lucene Users List" Reply-To: "Lucene Users List" Delivered-To: mailing list lucene-user@jakarta.apache.org Received: (qmail 3905 invoked by uid 98); 12 Dec 2002 17:13:34 -0000 X-Antivirus: nagoya (v4218 created Aug 14 2002) Message-ID: <3DF8C360.2090100@lub.umontreal.ca> Date: Thu, 12 Dec 2002 12:12:00 -0500 From: stephane vaucher User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:0.9.4) Gecko/20011126 Netscape6/6.2.1 X-Accept-Language: en-us MIME-Version: 1.0 To: Lucene Users List Subject: Re: Accentuated characters References: <187D6D956106D84E9D8B280F6458FE140EFC49@merc12.na.sas.com> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 8bit X-Spam-Rating: daedalus.apache.org 1.6.2 0/1000/N X-Spam-Rating: daedalus.apache.org 1.6.2 0/1000/N There is no problem with package scopes: This is how I remove trailing 's' chars: String word = token.termText(); if(word.endsWith("s")){ word = word.substring(0, word.length() - 1); } if ( !word.equals( token.termText() ) ) { return new Token( word, token.startOffset(), token.endOffset(), token.type() ); } I'll take a look at how the Collator works to see if I can make a generic (maybe locale specific) string normaliser so I could specify the level of differences. Stephane Eric Isakson wrote: >If you really want to make your own TokenFilter, have a look at org.apache.lucene.analysis.LowerCaseFilter.next() > >it does: > public final Token next() throws java.io.IOException { > Token t = input.next(); > > if (t == null) > return null; > > t.termText = t.termText.toLowerCase(); > > return t; > } > >The termText member of the Token class is package scoped, so you will have to implement your filter in the org.apache.lucene.analysis package. No worries about encoding as the termText is already a java (unicode) string. You will just have to provide the mechanism to get the accented characters converted to there non-accented equivalents. java.text.Collator has some magic that does this for string comparisons but I couldn't find any public methods that give you access to convert a string to its non-accented equivalent. > >Eric >-- >Eric D. Isakson SAS Institute Inc. >Application Developer SAS Campus Drive >XML Technologies Cary, NC 27513 >(919) 531-3639 http://www.sas.com > > > >-----Original Message----- >From: stephane vaucher [mailto:vaucher@LUB.UMontreal.CA] >Sent: Tuesday, December 10, 2002 2:58 PM >To: lucene-user@jakarta.apache.org >Subject: Accentuated characters > > >Hello everyone, > >I wish to implement a TokenFilter that will remove accentuated >characters so for example '�' will become 'e'. As I would rather not >reinvent the wheel, I've tried to find something on the web and on the >mailing lists. I saw a mention of a contrib that could do this (see >http://www.mail-archive.com/lucene-user%40jakarta.apache.org/msg02146.html), >but I don't see anything applicable. > >Has anyone done this yet, if so I would much appreciate some pointers >(or code), otherwise, I'll be happy to contribute whatever I produce >(but it might be very simple since I'll only need to deal with french). > >Cheers, >Stephane > > >-- >To unsubscribe, e-mail: >For additional commands, e-mail: > > >-- >To unsubscribe, e-mail: >For additional commands, e-mail: > > -- To unsubscribe, e-mail: For additional commands, e-mail: