lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paul Jakubik <p...@purediscovery.com>
Subject ReusableAnalyzerBase bug in 3.4.0?
Date Tue, 01 Nov 2011 22:12:16 GMT
Hi,

I think I found a bug in ReusableAnalyzerBase, but am also wondering if I'm
simply missing something. Let me describe what I am seeing, and maybe you
can point out where I'm making bad assumptions.

By using the ReusableAnalyzerBase you can create a single shared analyzer,
and it contains code to make the interesting parts of your analyzer thread
local.

Part of making this work is putting all of the interesting components
inside of of ReusableAnalyzerBase.TokenStreamComponents.

When you call ReusableAnalyzerBase.reusableTokenStream, it checks if it has
a thread local TokenStreamComponents, and if so it calls
TokenStreamComponents.reset(Reader) resetting the token source. This method
does not reset the TokenStream sink in TokenStreamComponents.

Because of this, if any of the filters in the TokenStream are stateful, you
have to recreate them instead of resetting them and using them again. So if
you use a filter like LimitTokenCountFilter or ShingleFilter, you have to
recreate it, even though these filters have reset methods that could be
called.

Am I missing important reasons why TokenStreamComponents.reset is
implemented as:
    protected boolean reset(final Reader reader) throws IOException {
      source.reset(reader);
      return true;
    }

instead of
    protected boolean reset(final Reader reader) throws IOException {
      source.reset(reader);
      *sink.reset();*
      return true;
    }

If there is a good reason to avoid resetting the sink here, then would it
help other people to better document that implementations of
ReusableAnalyzerBase.createComponents should not create stateful components?

Paul

Mime
View raw message