lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Hostetter <hossman_luc...@fucit.org>
Subject Re: Question about special characters
Date Fri, 26 May 2006 22:52:24 GMT

: Thks for the reply, ut I don't know how to do this change in
: SOLatin1AccentFilter.
: Can you give me some advice in this action?

I've never really looked at the internals of ISOLatin1AccentFilter, but
the basic idea is to subclass it with a new TokenFilter that maintains a
one token "buffer" of the token stream, and every other time next is
called you either return the token from the buffer (as is) or you return a
token with the accents striped. sinve ISOLatin1AccentFilter has a method
called removeAccents i'm guessing it would look soemthing like
this...

   public class YourTokenFilter extends
     private Token bufToken = null;
     public Token next() {
       if (null != bufToken) {
          Token t = bufToken;
          bufToken=null;
          return t;
       }
       Token t = input.next
       bufToken = new Token(removeAccents(t.termText()),
                            t.startOffset(),t.endOffset(),t.type());
       bufToken.setPositionIncrement(0);
       return t;
     }
   }


...but i haven't tested that (or ever written a TokenFilter of my own for
that matter.)


:
: 2006/5/25, Chris Hostetter <hossman_lucene@fucit.org>:
: >
: >
: > I think I'm missing something here.  the whole point of the
: > ISOLatin1AccentFilter is to replace accented characters with their
: > unaccented equivalent -- it sounds like that's working just fine, If you
: > want teh words in teh term vector to contain the accents, why don't you
: > stop using that filter?
: >
: > if the problem is that you need to be able to match on both the accented
: > form and the non accented form, perhaps you should have two fields, or
: > modify the ISOLatin1AccentFilter so it puts both versions of the token in
: > the TokenStream with the same position?
: >
: >
: > : > The problem is special characters like à, ä , ç or ñ latin characters
: > in
: > : > the text.
: > : > Now I use iso latin filter, but the problem is when I want to obtain
: > most
: > : > term used. These term are stored without ` ´ ^ or another "character
: > : > attribute".
: > : > For example "plàntïuç" (it isn't a real word) is stored like the term
: > : > "plantiuc".
: > : > How can I do to have in term vector the word "plàntïuç".
: > : >
: > : > thks for all replies.
: > : > PD: excuse if this question is solved somewhere, but I don't saw it.
: >
: >
: >
: > -Hoss
: >
: >
: > ---------------------------------------------------------------------
: > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
: > For additional commands, e-mail: java-user-help@lucene.apache.org
: >
: >
:



-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message