lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Uwe Schindler" <...@thetaphi.de>
Subject RE: Thai Analyzer in 3.0.2
Date Wed, 01 Dec 2010 06:47:31 GMT
Only analyzers which have non-final tokenStream/reusableTokenStream methods
have this. As soon as Analyzer itself or both methods are final, this code
block is not needed.

 

-----

Uwe Schindler

H.-H.-Meier-Allee 63, D-28213 Bremen

http://www.thetaphi.de <http://www.thetaphi.de/> 

eMail: uwe@thetaphi.de

 

From: DM Smith [mailto:dmsmith555@gmail.com] 
Sent: Wednesday, December 01, 2010 6:46 AM
To: java-dev@lucene.apache.org
Subject: Thai Analyzer in 3.0.2

 

I'm curious about somethings in the ThaiAnalyzer

It has:
  @Override
  public TokenStream reusableTokenStream(String fieldName, Reader reader)
throws IOException {
    if (overridesTokenStreamMethod) {
      // LUCENE-1678: force fallback to tokenStream() if we
      // have been subclassed and that subclass overrides
      // tokenStream but not reusableTokenStream
      return tokenStream(fieldName, reader);
    }
    
    SavedStreams streams = (SavedStreams) getPreviousTokenStream();
    if (streams == null) {
      streams = new SavedStreams();
      streams.source = new StandardTokenizer(matchVersion, reader);
      streams.result = new StandardFilter(streams.source);
      streams.result = new ThaiWordFilter(streams.result);
      streams.result = new
StopFilter(StopFilter.getEnablePositionIncrementsVersionDefault(matchVersion
),
                                      streams.result,
StopAnalyzer.ENGLISH_STOP_WORDS_SET);
      setPreviousTokenStream(streams);
    } else {
      streams.source.reset(reader);
      streams.result.reset(); // reset the ThaiWordFilter's state
    }
    return streams.result;
  }


I'm really curious why reusableTokenStream has the block:
    if (overridesTokenStreamMethod) {
      // LUCENE-1678: force fallback to tokenStream() if we
      // have been subclassed and that subclass overrides
      // tokenStream but not reusableTokenStream
      return tokenStream(fieldName, reader);
    }
but nearly no other Analyzer in contrib has it. (None that I have seen.)
Shouldn't it be in all of them?

And also about:
      streams.source.reset(reader);
      streams.result.reset(); // reset the ThaiWordFilter's state
This calls reset on everything from the bottom to the top.
Most of the implementations of the class just have
        streams.source.reset(reader);

It seems to me that calling streams.source.reset(reader) presumes that the
chain only needs to be reset at the tokenizer.

The documentation for reset() does not indicate that it should always call
super.reset() or input.reset(), which is necessary for chaining back up to
the tokenizer.

If we go to a declarative model for an analyzer, I would think that one
would always want to do both.

 

-- DM


Mime
View raw message