From "Mike O'Leary" <>
Subject Lucene 4.0 PerFieldAnalyzerWrapper question
Date Tue, 25 Sep 2012 23:57:05 GMT
I am updating an analyzer that uses a particular configuration of the PerFieldAnalyzerWrapper
to work with Lucene 4.0. A few of the fields use a custom analyzer and StandardTokenizer and
the other fields use the KeywordAnalyzer and KeywordTokenizer. The older version of the analyzer
looks like this:

public class MyPerFieldAnalyzer extends Analyzer {
  PerFieldAnalyzerWrapper _analyzer;

  public MyPerFieldAnalyzer() {
    Map<String, Analyzer> analyzerMap = new HashMap<String, Analyzer>();

    analyzerMap.put("IDNumber", new KeywordAnalyzer());

    _analyzer = new PerFieldAnalyzerWrapper(new CustomAnalyzer(), analyzerMap);

  public TokenStream tokenStream(String fieldname, Reader reader) {
    TokenStream stream = _analyzer.tokenStream(fieldname, reader);
    return stream;

In older versions of Lucene it is necessary to define a tokenStream function, but in 4.0 it
is not (in fact, TokenStream is declared final, so you can't). Instead, it is necessary to
define a createComponents function that takes the same arguments as the tokenStream function
and returns a TokenStreamComponents object. The TokenStreamComponents constructor has a Tokenizer
argument and a TokenStream argument. I assume I can just use the same code to provide the
TokenStream object as was used in the older analyzer's tokenStream function, but I don't see
how to provide a Tokenizer object, unless it is by creating a separate map of field names
to Tokenizers that works the same way the analyzer map does. Is that the best way to do this,
or is there a better way? For example, would it be better to inherit from AnalyzerWrapper
instead of from Analyzer? In that case I would need to define getWrappedAnalyzer and wrappedComponents
functions. I think in that case I would still need to put the same kind of logic in the wrapComponents
function that specifies which tokenizer to use with which field, though. It looks like the
PerFieldAnalyzerWrapper itself assumes that the same tokenizer will be used with all fields,
as its wrapComponents function ignores the fieldname parameter. I would appreciate any help
in finding out the best way to update this analyzer and to write the required function(s).

