lucene-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Steven A Rowe <sar...@syr.edu>
Subject RE: Tokenizer Question
Date Mon, 05 Jan 2009 22:58:53 GMT
Hi ayyanar,

I should have mentioned in my previous email that the general@lucene.apache.org mailing list
has very few subscribers - you'll get much better response on the java-user@l.a.o mailing
list.

On 01/05/2009 at 3:07 PM, ayyanar wrote:
> My objective is to retain the keyword (input stream) as is a token like
> a keyword tokenizer does and also split the keyword by whitespace and
> maintain that tokens as a white space tokenizer does

Right, ShingleFilter won't do this for you.

The following, if used to filter WhitespaceTokenizer's output, is similar to what you want
(note: untested, and also note that this assumes you're using Lucene v2.4.0, and not a recent
trunk version, which includes the new TokenStream API introduced with LUCENE-1422: <https://issues.apache.org/jira/browse/LUCENE-1422>):

-----

/**
 * Extends CachingTokenFilter to output a space-separated-
 * concatenated-all-input-stream-terms token, followed by
 * all of the original input stream tokens.
 * One for all and (then) all for one!
 */
public class ThreeMusketeersFilter extends CachingTokenFilter {

  private boolean concatenatedTokenOutput = false;

  public ThreeMusketeersFilter(TokenStream input) {
    super(input);
  }
  
  public Token next(final Token reusableToken) throws IOException {
    assert reusableToken != null;
    if (concatenatedTokenOutput) {
     	return super.next(reusableToken);
    } else {
      concatenatedTokenOutput = true;
    	Token firstToken = super.next(reusableToken);
      if (firstToken == null) {
        return null;
      }
      StringBuffer buffer = new StringBuffer();
      buffer.append(firstToken.termBuffer());
      int start = firstToken.startOffset();
      int end = firstToken.endOffset();
      for (Token nextToken = super.next(reusableToken) ;
           nextToken != null ;
           nextToken = super.next(reusableToken)) {
        end = nextToken.endOffset();
        buffer.append(' ');  // add a space between terms
        buffer.append(nextToken.termBuffer());
      }
      reusableToken.clear();
      reusableToken.resizeTermBuffer(buffer.length());
      reusableToken.setTermLength(buffer.length());
      buffer.getChars(0, buffer.length(), reusableToken.termBuffer(), 0);
      reusableToken.setStartOffset(start);
      reusableToken.setEndOffset(end);
      super.reset(); // Rewind input stream to get the individual tokens
      return reusableToken;
    }
  }
  
  public void reset() throws IOException {
    super.reset();
    concatenatedTokenOutput = false;
  }
}

Mime
View raw message