lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paul Taylor <paul_t...@fastmail.fm>
Subject Re: Creating additional tokens from input in a token filter
Date Thu, 03 Nov 2011 11:35:04 GMT
On 02/11/2011 20:48, Paul Taylor wrote:
> On 02/11/2011 17:15, Uwe Schindler wrote:
>> Hi Paul,
>>
>> There is WordDelimiterFilter which does exactly what you want. In 3.x 
>> its
>> unfortunately only shipped in Solr JAR file, but in 4.0 it's in the
>> analyzers-common module.
> Okay so I found it and its looks very interesting but really overly 
> complex for what I want to do and doesnt handle my specific case, 
> could anyone possibly give a code example
> of how I create two tokens from one, assume I already know how to 
> split it (I cant work that bit out)
>
I took another look at WordDelimiterFilter and managed to get it work, 
sweet , thanks very much

In case of interest to others, and because I had to hack WordDelimiter a 
little bit this is my solution.

1. I changed my existing tokenizer to convert control/punctuation chars 
to a '-' rather than dropping them

       if (type == ALPHANUMANDPUNCTUATION) {      // remove no alpha 
numerics
             int upto = 0;
             for (int i = 0; i < bufferLength; i++) {
                 char c = buffer[i];
                 if (!Character.isLetterOrDigit(c) )
                 {
                     //Replace control/punctuation chars with '-' to 
help word delimiter
                     buffer[upto++] = '-';
                 }
                 else {
                     //Normal Char
                     buffer[upto++] = c;
                 }
             }

2. I took a copy of WordDelimiter and WordDelimiterIterator and modified 
it slightly so that it only did anything for attributetype equals 
ALPHANUMANDPUNCTUATION (couldnt see any constructor that would let me 
set this)

public boolean incrementToken() throws IOException {
     while (true) {
       if (!hasSavedState) {
         // process a new input word
         if (!input.incrementToken()) {
           return false;
         }

         //Use Word Delimiter just on these tokens
         if (typeAttribute.type() != 
MusicbrainzTokenizer.TOKEN_TYPES[MusicbrainzTokenizer.ALPHANUMANDPUNCTUATION];) 
{
             return true;
         }
         ...................
}

3. Added my WordDelimiter and just set it to to generateWordParts

streams.filteredTokenStream = new 
WordDelimiterFilter(streams.filteredTokenStream,
                                           
WordDelimiterIterator.DEFAULT_WORD_DELIM_TABLE,
                                           1,
                                           0,
                                           0,
                                           0,
                                           0,
                                           0,
                                           0,
                                           0,
                                           0,
                                          null);

Cheers Paul

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message