lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Uwe Schindler" <...@thetaphi.de>
Subject RE: Creating additional tokens from input in a token filter
Date Wed, 02 Nov 2011 17:15:33 GMT
Hi Paul,

There is WordDelimiterFilter which does exactly what you want. In 3.x its
unfortunately only shipped in Solr JAR file, but in 4.0 it's in the
analyzers-common module.

Uwe
-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe@thetaphi.de


> -----Original Message-----
> From: Paul Taylor [mailto:paul_t100@fastmail.fm]
> Sent: Wednesday, November 02, 2011 5:12 PM
> To: java-user@lucene.apache.org
> Subject: Creating additional tokens from input in a token filter
> 
> I have a tokenizer filter that takes tokens and then drops any non
alphanumeric
> characters
> 
> i.e 'this-stuff' becomes 'thisstuff'
> 
> but what I actually want it to do is split the one token into multiple
tokens using
> the non-alphanumeric characters as word boundaries
> 
> i.e 'this-stuff' becomes 'this stuff'
> 
> How do I do this ?
> 
> thanks Paul
> 
> (You may be wondering why I just didn't filter out these characters at the
> tokenizer stage, but I had to keep them in to solve another problem, that
is they
> needed to be kept for 'words' that only consisted of non-alphanumeric
> characters)
> 
> This is my existing class:
> 
> public class MusicbrainzTokenizerFilter extends TokenFilter {
>      /**
>       * Construct filtering <i>in</i>.
>       */
>      public MusicbrainzTokenizerFilter(TokenStream in) {
>          super(in);
>          termAtt = (CharTermAttribute)
> addAttribute(CharTermAttribute.class);
>          typeAtt = (TypeAttribute) addAttribute(TypeAttribute.class);
>      }
> 
>      private static final String ALPHANUMANDPUNCTUATION
>              =
> MusicbrainzTokenizer.TOKEN_TYPES[MusicbrainzTokenizer.ALPHANUMANDPU
> NCTUATION];
> 
>      // this filters uses attribute type
>      private TypeAttribute       typeAtt;
>      private CharTermAttribute   termAtt;
> 
>      /**
>       * Returns the next token in the stream, or null at EOS.
>       * <p>Removes <tt>'</tt> from the words.
>       * <p>Removes dots from acronyms.
>       */
>      public final boolean incrementToken() throws java.io.IOException {
>          if (!input.incrementToken()) {
>              return false;
>          }
> 
>          char[] buffer = termAtt.buffer();
>          final int bufferLength = termAtt.length();
>          final String type = typeAtt.type();
> 
>          if (type == ALPHANUMANDPUNCTUATION) {      // remove no alpha
> numerics
>              int upto = 0;
>              for (int i = 0; i < bufferLength; i++) {
>                  char c = buffer[i];
>                  if (!Character.isLetterOrDigit(c) )
>                  {
>                      //Do Nothing, (drop the character)
>                  }
>                  else {
>                      buffer[upto++] = c;
>                  }
>              }
>              termAtt.setLength(upto);
>          }
>          return true;
>      }
> }
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message