lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From stephane vaucher <vauc...@LUB.UMontreal.CA>
Subject Re: Accentuated characters
Date Thu, 12 Dec 2002 19:41:11 GMT
Fair enough, but a "protected" would only allow subclasses from 
accessing it. Personally, I would rather not have to use a subclass to 
implement my feature. I think the logic behind this is that its an 
intrinsic property of a Term, thus it should be immutable, as any 
modifications to this object might have important side-effects.

Stephane


Eric Isakson wrote:

>That method works too. Putting your token filter in the org.apache.lucene.analysis package
and replacing: 
>
>            if ( !word.equals( token.termText() ) ) {
>                return new Token( word, token.startOffset(),
>                                  token.endOffset(), token.type() );
>            }
>
>with
>
>            token.termText = word;
>		return token;
>
>will make your code operate more efficiently as you won't be creating a bunch of new Token
objects that will have to be garbage collected. Would be nice if the termText was "protected"
rather than package scoped.
>
>Eric
>
>-----Original Message-----
>From: stephane vaucher [mailto:vaucher@LUB.UMontreal.CA]
>Sent: Thursday, December 12, 2002 12:12 PM
>To: Lucene Users List
>Subject: Re: Accentuated characters
>
>
>There is no problem with package scopes:
>
>This is how I remove trailing 's' chars:
>
>            String word = token.termText();
>           
>            if(word.endsWith("s")){
>                word = word.substring(0, word.length() - 1);
>            }
>           
>            if ( !word.equals( token.termText() ) ) {
>                return new Token( word, token.startOffset(),
>                                  token.endOffset(), token.type() );
>            }
>
>I'll take a look at how the Collator works to see if I can make a 
>generic (maybe locale specific) string normaliser so I could specify the 
>level of differences.
>
>Stephane
>
>Eric Isakson wrote:
>
>>If you really want to make your own TokenFilter, have a look at org.apache.lucene.analysis.LowerCaseFilter.next()
>>
>>it does:
>> public final Token next() throws java.io.IOException {
>>   Token t = input.next();
>>
>>   if (t == null)
>>     return null;
>>
>>   t.termText = t.termText.toLowerCase();
>>
>>   return t;
>> }
>>
>>The termText member of the Token class is package scoped, so you will have to implement
your filter in the org.apache.lucene.analysis package. No worries about encoding as the termText
is already a java (unicode) string. You will just have to provide the mechanism to get the
accented characters converted to there non-accented equivalents. java.text.Collator has some
magic that does this for string comparisons but I couldn't find any public methods that give
you access to convert a string to its non-accented equivalent.
>>
>>Eric
>>--
>>Eric D. Isakson        SAS Institute Inc.
>>Application Developer  SAS Campus Drive
>>XML Technologies       Cary, NC 27513
>>(919) 531-3639         http://www.sas.com
>>
>>
>>
>>-----Original Message-----
>>From: stephane vaucher [mailto:vaucher@LUB.UMontreal.CA]
>>Sent: Tuesday, December 10, 2002 2:58 PM
>>To: lucene-user@jakarta.apache.org
>>Subject: Accentuated characters
>>
>>
>>Hello everyone,
>>
>>I wish to implement a TokenFilter that will remove accentuated 
>>characters so for example 'é' will become 'e'. As I would rather not 
>>reinvent the wheel, I've tried to find something on the web and on the 
>>mailing lists. I saw a mention of a contrib that could do this (see 
>>http://www.mail-archive.com/lucene-user%40jakarta.apache.org/msg02146.html), 
>>but I don't see anything applicable.
>>
>>Has anyone done this yet, if so I would much appreciate some pointers 
>>(or code), otherwise, I'll be happy to contribute whatever I produce 
>>(but it might be very simple since I'll only need to deal with french).
>>
>>Cheers,
>>Stephane
>>
>>
>>--
>>To unsubscribe, e-mail:   <mailto:lucene-user-unsubscribe@jakarta.apache.org>
>>For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>
>>
>>
>>--
>>To unsubscribe, e-mail:   <mailto:lucene-user-unsubscribe@jakarta.apache.org>
>>For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>
>>
>>
>
>
>
>--
>To unsubscribe, e-mail:   <mailto:lucene-user-unsubscribe@jakarta.apache.org>
>For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>
>
>
>--
>To unsubscribe, e-mail:   <mailto:lucene-user-unsubscribe@jakarta.apache.org>
>For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>
>
>



--
To unsubscribe, e-mail:   <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>


Mime
View raw message