lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From dokondr <doko...@gmail.com>
Subject Re: TokenStream: How to get token text?
Date Tue, 25 Dec 2012 20:17:10 GMT
Hi Steve,
Thanks for you help (just found your e-mail in list archive), your solution
works!
Below is complete working example... However, before finding your answer, I
hacked a straw-man solution, which is bad way to solve the problem:

        // Hack out token - bad way!
        String tmp = ts.reflectAsString(false);
        String sameToken = (tmp.split(",")[0]).split("=")[1];
        System.out.println("*** Same token : " + sameToken);

It is not a right way, I repeat and I give here just for fun.

---- Complete working example ----
    Version matchVersion = Version.LUCENE_40; // Substitute desired Lucene
version for XY

    Analyzer analyzer = new RussianAnalyzer(matchVersion); // or any other
analyzer
    TokenStream ts = analyzer.tokenStream("myfield", new StringReader("some
text goes here"));
    OffsetAttribute offsetAtt = ts.addAttribute(OffsetAttribute.class);
    // To get token strings we need this:
    CharTermAttribute termAtt = ts.addAttribute(CharTermAttribute.class);

    try {
      ts.reset(); // Resets this stream to the beginning. (Required)
      while (ts.incrementToken()) {
        // Use AttributeSource.reflectAsString(boolean)
        // for token stream debugging.
        System.out.println("token: " + ts.reflectAsString(true));

        // Right way to get tokens
        String token = termAtt.toString();
        System.out.println("*** Token: " + token);

        // Hack out token - bad way!
        String tmp = ts.reflectAsString(false);
        String sameToken = (tmp.split(",")[0]).split("=")[1];
        System.out.println("*** Same token : " + sameToken);

        System.out.println("token start offset: " +
offsetAtt.startOffset());
        System.out.println("token end offset: " + offsetAtt.endOffset());
      }
      ts.end();   // Perform end-of-stream operations, e.g. set the final
offset.
    } finally {
      ts.close(); // Release resources associated with this stream.
      analyzer.close();
    }

Hi Dima,
>
> The example code you mentioned in your other recent email is pretty close.
>
> The only thing you'd probably want to add is access to the
> CharTermAttribute:
>
> CharTermAttribute termAtt = addAttribute(CharTermAttribute.class);
>
> and then in the loop over ts.incrementToken(), you can get to the output
> tokens
> using termAtt.buffer() and termAtt.length(), or if you're going to
> Stringify
> tokens anyway, termAtt.toString().
>
> Steve
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message