lucene-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Julio Oliveira" <julio.julioolive...@gmail.com>
Subject Re: Tokenizer Question
Date Tue, 06 Jan 2009 08:54:38 GMT
Do a while for a StringTokenized .
new StringTokenized (VarToTokenized," ");  ( this return a list of tokens
with the words split by an space.

jOliveira

On Mon, Jan 5, 2009 at 7:58 PM, Steven A Rowe <sarowe@syr.edu> wrote:

> Hi ayyanar,
>
> I should have mentioned in my previous email that the
> general@lucene.apache.org mailing list has very few subscribers - you'll
> get much better response on the java-user@l.a.o mailing list.
>
> On 01/05/2009 at 3:07 PM, ayyanar wrote:
> > My objective is to retain the keyword (input stream) as is a token like
> > a keyword tokenizer does and also split the keyword by whitespace and
> > maintain that tokens as a white space tokenizer does
>
> Right, ShingleFilter won't do this for you.
>
> The following, if used to filter WhitespaceTokenizer's output, is similar
> to what you want (note: untested, and also note that this assumes you're
> using Lucene v2.4.0, and not a recent trunk version, which includes the new
> TokenStream API introduced with LUCENE-1422: <
> https://issues.apache.org/jira/browse/LUCENE-1422>):
>
> -----
>
> /**
>  * Extends CachingTokenFilter to output a space-separated-
>  * concatenated-all-input-stream-terms token, followed by
>  * all of the original input stream tokens.
>  * One for all and (then) all for one!
>  */
> public class ThreeMusketeersFilter extends CachingTokenFilter {
>
>  private boolean concatenatedTokenOutput = false;
>
>  public ThreeMusketeersFilter(TokenStream input) {
>    super(input);
>  }
>
>  public Token next(final Token reusableToken) throws IOException {
>    assert reusableToken != null;
>    if (concatenatedTokenOutput) {
>        return super.next(reusableToken);
>    } else {
>      concatenatedTokenOutput = true;
>        Token firstToken = super.next(reusableToken);
>      if (firstToken == null) {
>        return null;
>      }
>      StringBuffer buffer = new StringBuffer();
>      buffer.append(firstToken.termBuffer());
>      int start = firstToken.startOffset();
>      int end = firstToken.endOffset();
>      for (Token nextToken = super.next(reusableToken) ;
>           nextToken != null ;
>           nextToken = super.next(reusableToken)) {
>        end = nextToken.endOffset();
>        buffer.append(' ');  // add a space between terms
>        buffer.append(nextToken.termBuffer());
>      }
>      reusableToken.clear();
>      reusableToken.resizeTermBuffer(buffer.length());
>      reusableToken.setTermLength(buffer.length());
>      buffer.getChars(0, buffer.length(), reusableToken.termBuffer(), 0);
>      reusableToken.setStartOffset(start);
>      reusableToken.setEndOffset(end);
>      super.reset(); // Rewind input stream to get the individual tokens
>      return reusableToken;
>    }
>  }
>
>  public void reset() throws IOException {
>    super.reset();
>    concatenatedTokenOutput = false;
>  }
> }
>



-- 
Saludos

Julio Oliveira - Buenos Aires

julio.julioOliveira@gmail.com

http://www.linkedin.com/in/juliomoliveira

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message