lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paul Taylor <paul_t...@fastmail.fm>
Subject Creating additional tokens from input in a token filter
Date Wed, 02 Nov 2011 16:12:09 GMT
I have a tokenizer filter that takes tokens and then drops any non 
alphanumeric characters

i.e 'this-stuff' becomes 'thisstuff'

but what I actually want it to do is split the one token into multiple 
tokens using the non-alphanumeric characters as word boundaries

i.e 'this-stuff' becomes 'this stuff'

How do I do this ?

thanks Paul

(You may be wondering why I just didn't filter out these characters at 
the tokenizer stage, but I had to keep them in to solve another problem, 
that is they needed to be kept for 'words' that only consisted of 
non-alphanumeric characters)

This is my existing class:

public class MusicbrainzTokenizerFilter extends TokenFilter {
     /**
      * Construct filtering <i>in</i>.
      */
     public MusicbrainzTokenizerFilter(TokenStream in) {
         super(in);
         termAtt = (CharTermAttribute) 
addAttribute(CharTermAttribute.class);
         typeAtt = (TypeAttribute) addAttribute(TypeAttribute.class);
     }

     private static final String ALPHANUMANDPUNCTUATION
             = 
MusicbrainzTokenizer.TOKEN_TYPES[MusicbrainzTokenizer.ALPHANUMANDPUNCTUATION];

     // this filters uses attribute type
     private TypeAttribute       typeAtt;
     private CharTermAttribute   termAtt;

     /**
      * Returns the next token in the stream, or null at EOS.
      * <p>Removes <tt>'</tt> from the words.
      * <p>Removes dots from acronyms.
      */
     public final boolean incrementToken() throws java.io.IOException {
         if (!input.incrementToken()) {
             return false;
         }

         char[] buffer = termAtt.buffer();
         final int bufferLength = termAtt.length();
         final String type = typeAtt.type();

         if (type == ALPHANUMANDPUNCTUATION) {      // remove no alpha 
numerics
             int upto = 0;
             for (int i = 0; i < bufferLength; i++) {
                 char c = buffer[i];
                 if (!Character.isLetterOrDigit(c) )
                 {
                     //Do Nothing, (drop the character)
                 }
                 else {
                     buffer[upto++] = c;
                 }
             }
             termAtt.setLength(upto);
         }
         return true;
     }
}

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message