lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Daniel Noll <dan...@nuix.com>
Subject Extending StandardAnalyzer considered harmful
Date Thu, 04 Jun 2009 04:58:10 GMT
Hi all.
I just want to tell some people an interesting story. :-)

We had a custom analyser which was implemented like this:

    public class NoStopWordsAnalyser extends StandardAnalyzer {
        public TokenStream tokenStream(String fieldName, Reader reader) {
            TokenStream result = new StandardTokenizer(reader);
            result = new StandardFilter(result);
            result = new LowerCaseFilter(result);
            return result;
        }
    }

Now, this seemed all well and good, and we unit tested the analyser and it
worked.

Some time later, a newer version of Lucene added reusableTokenStream() for a
performance optimisation.  The indexing side was updated to call the new
reusableTokenStream() method.  Since we had extended StandardAnalyzer, it
ended up going to StandardAnalyzer's implementation, which puts on a
StopFilter, so our NoStopWordsAnalyser ended up including a StopFilter.
 This in itself would have been a minor issue as well, but the QueryParser
side still called the older tokenStream() method, which resulted in
different token streams being used for indexing and querying -- even though
it was the same analyser!

The end result is that we could never get a hit if the query included any
stop words.

The solution in this case was to extend Analyzer instead.  I just thought it
was interesting that we got affected so much by a new method to the API
which didn't cause a compilation error or raise any other flags which would
indicate we had done something wrong.  Our own unit tests still passed
because they were testing the older method.  Had StandardAnalyzer or its
tokenStream() method been made final during this change, we would have had
our compilation fail.  Had it been final from the very first time the class
was introduced, it would have prevented the problem in its entirety as we
would have realised much sooner that it wasn't safe to override in the
beginning.

Daniel



-- 
Daniel Noll                            Forensic and eDiscovery Software
Senior Developer                              The world's most advanced
Nuix                                                email data analysis
http://nuix.com/                                and eDiscovery software

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message