lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paul Taylor <>
Subject Creating additional tokens from input in a token filter
Date Wed, 02 Nov 2011 16:12:09 GMT
I have a tokenizer filter that takes tokens and then drops any non 
alphanumeric characters

i.e 'this-stuff' becomes 'thisstuff'

but what I actually want it to do is split the one token into multiple 
tokens using the non-alphanumeric characters as word boundaries

i.e 'this-stuff' becomes 'this stuff'

How do I do this ?

thanks Paul

(You may be wondering why I just didn't filter out these characters at 
the tokenizer stage, but I had to keep them in to solve another problem, 
that is they needed to be kept for 'words' that only consisted of 
non-alphanumeric characters)

This is my existing class:

public class MusicbrainzTokenizerFilter extends TokenFilter {
      * Construct filtering <i>in</i>.
     public MusicbrainzTokenizerFilter(TokenStream in) {
         termAtt = (CharTermAttribute) 
         typeAtt = (TypeAttribute) addAttribute(TypeAttribute.class);

     private static final String ALPHANUMANDPUNCTUATION

     // this filters uses attribute type
     private TypeAttribute       typeAtt;
     private CharTermAttribute   termAtt;

      * Returns the next token in the stream, or null at EOS.
      * <p>Removes <tt>'</tt> from the words.
      * <p>Removes dots from acronyms.
     public final boolean incrementToken() throws {
         if (!input.incrementToken()) {
             return false;

         char[] buffer = termAtt.buffer();
         final int bufferLength = termAtt.length();
         final String type = typeAtt.type();

         if (type == ALPHANUMANDPUNCTUATION) {      // remove no alpha 
             int upto = 0;
             for (int i = 0; i < bufferLength; i++) {
                 char c = buffer[i];
                 if (!Character.isLetterOrDigit(c) )
                     //Do Nothing, (drop the character)
                 else {
                     buffer[upto++] = c;
         return true;

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message