lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paul Taylor <paul_t...@fastmail.fm>
Subject Re: Creating additional tokens from input in a token filter
Date Wed, 02 Nov 2011 20:48:47 GMT
On 02/11/2011 17:15, Uwe Schindler wrote:
> Hi Paul,
>
> There is WordDelimiterFilter which does exactly what you want. In 3.x its
> unfortunately only shipped in Solr JAR file, but in 4.0 it's in the
> analyzers-common module.
Okay so I found it and its looks very interesting but really overly 
complex for what I want to do and doesnt handle my specific case, could 
anyone possibly give a code example
of how I create two tokens from one, assume I already know how to split 
it (I cant work that bit out)

     public final boolean incrementToken() throws java.io.IOException {
         if (!input.incrementToken()) {
             return false;
         }

         char[] buffer = termAtt.buffer();
         final int bufferLength = termAtt.length();
         final String type = typeAtt.type();

         if (type == ALPHANUMANDPUNCTUATION) {
             int upto = 0;

             for (int i = 0; i < bufferLength; i++) {
                 char c = buffer[i];
                 if (!Character.isLetterOrDigit(c) )
                 {
                     //TODO PUT ALL CHARS AFTER THIS INTO A NEW TOKEN
                 }
                 else {
                     buffer[upto++] = c;
                 }
             }
             termAtt.setLength(upto);
         }
         return true;
     }



>> -----Original Message-----
>> From: Paul Taylor [mailto:paul_t100@fastmail.fm]
>> Sent: Wednesday, November 02, 2011 5:12 PM
>> To: java-user@lucene.apache.org
>> Subject: Creating additional tokens from input in a token filter
>>
>> I have a tokenizer filter that takes tokens and then drops any non
> alphanumeric
>> characters
>>
>> i.e 'this-stuff' becomes 'thisstuff'
>>
>> but what I actually want it to do is split the one token into multiple
> tokens using
>> the non-alphanumeric characters as word boundaries
>>
>> i.e 'this-stuff' becomes 'this stuff'
>>
>> How do I do this ?
>>
>> thanks Paul
>>
>> (You may be wondering why I just didn't filter out these characters at the
>> tokenizer stage, but I had to keep them in to solve another problem, that
> is they
>> needed to be kept for 'words' that only consisted of non-alphanumeric
>> characters)
>>
>> This is my existing class:
>>
>> public class MusicbrainzTokenizerFilter extends TokenFilter {
>>       /**
>>        * Construct filtering<i>in</i>.
>>        */
>>       public MusicbrainzTokenizerFilter(TokenStream in) {
>>           super(in);
>>           termAtt = (CharTermAttribute)
>> addAttribute(CharTermAttribute.class);
>>           typeAtt = (TypeAttribute) addAttribute(TypeAttribute.class);
>>       }
>>
>>       private static final String ALPHANUMANDPUNCTUATION
>>               =
>> MusicbrainzTokenizer.TOKEN_TYPES[MusicbrainzTokenizer.ALPHANUMANDPU
>> NCTUATION];
>>
>>       // this filters uses attribute type
>>       private TypeAttribute       typeAtt;
>>       private CharTermAttribute   termAtt;
>>
>>       /**
>>        * Returns the next token in the stream, or null at EOS.
>>        *<p>Removes<tt>'</tt>  from the words.
>>        *<p>Removes dots from acronyms.
>>        */
>>       public final boolean incrementToken() throws java.io.IOException {
>>           if (!input.incrementToken()) {
>>               return false;
>>           }
>>
>>           char[] buffer = termAtt.buffer();
>>           final int bufferLength = termAtt.length();
>>           final String type = typeAtt.type();
>>
>>           if (type == ALPHANUMANDPUNCTUATION) {      // remove no alpha
>> numerics
>>               int upto = 0;
>>               for (int i = 0; i<  bufferLength; i++) {
>>                   char c = buffer[i];
>>                   if (!Character.isLetterOrDigit(c) )
>>                   {
>>                       //Do Nothing, (drop the character)
>>                   }
>>                   else {
>>                       buffer[upto++] = c;
>>                   }
>>               }
>>               termAtt.setLength(upto);
>>           }
>>           return true;
>>       }
>> }
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message