lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From DM Smith <dmsmith...@gmail.com>
Subject Thai Analyzer in 3.0.2
Date Wed, 01 Dec 2010 05:46:17 GMT
I'm curious about somethings in the ThaiAnalyzer

It has:
  @Override
  public TokenStream reusableTokenStream(String fieldName, Reader reader) throws IOException
{
    if (overridesTokenStreamMethod) {
      // LUCENE-1678: force fallback to tokenStream() if we
      // have been subclassed and that subclass overrides
      // tokenStream but not reusableTokenStream
      return tokenStream(fieldName, reader);
    }
    
    SavedStreams streams = (SavedStreams) getPreviousTokenStream();
    if (streams == null) {
      streams = new SavedStreams();
      streams.source = new StandardTokenizer(matchVersion, reader);
      streams.result = new StandardFilter(streams.source);
      streams.result = new ThaiWordFilter(streams.result);
      streams.result = new StopFilter(StopFilter.getEnablePositionIncrementsVersionDefault(matchVersion),
                                      streams.result, StopAnalyzer.ENGLISH_STOP_WORDS_SET);
      setPreviousTokenStream(streams);
    } else {
      streams.source.reset(reader);
      streams.result.reset(); // reset the ThaiWordFilter's state
    }
    return streams.result;
  }


I'm really curious why reusableTokenStream has the block:
    if (overridesTokenStreamMethod) {
      // LUCENE-1678: force fallback to tokenStream() if we
      // have been subclassed and that subclass overrides
      // tokenStream but not reusableTokenStream
      return tokenStream(fieldName, reader);
    }
but nearly no other Analyzer in contrib has it. (None that I have seen.) Shouldn't it be in
all of them?

And also about:
      streams.source.reset(reader);
      streams.result.reset(); // reset the ThaiWordFilter's state
This calls reset on everything from the bottom to the top.
Most of the implementations of the class just have
	streams.source.reset(reader);

It seems to me that calling streams.source.reset(reader) presumes that the chain only needs
to be reset at the tokenizer.

The documentation for reset() does not indicate that it should always call super.reset() or
input.reset(), which is necessary for chaining back up to the tokenizer.

If we go to a declarative model for an analyzer, I would think that one would always want
to do both.


-- DM
Mime
View raw message