lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Carsten Schnober <>
Subject TokenStreamComponents in Lucene 4.0
Date Mon, 19 Nov 2012 16:44:38 GMT
I have recently updated to Lucene 4.0, but having problems with my
custom Analyzer/Tokenizer.

In the days of Lucene 3.6, it would work like this:

0. define constants lucene_version and indexdir
1. create an Analyzer: analyzer = new KoraAnalyzer() (our custom Analyzer)
2. create an IndexWriterConfiguration: config = new
IndexWriterConfig(lucene_version, analyzer)
3. create an IndexWriter writer = (indexdir, config)
4. for each document:
4.1. create a Document: Document doc = new Document()
4.2. create a Field: Field field = new Field("text", layerFile,
Field.Store.YES, Field.Index.ANALYZED_NO_NORMS,
4.3. add field to document: doc.add(field)
4.4. add document to writer: writer.add(doc)
5. close the writer (write to disk)

However, after switching to Lucene 4 and TokenStreamComponents, I'm
getting a strange behaviour: only the first document in the collection
is tokenized properly. The others do appear in the index, but
un-tokenized, although I have tried not to change anything in the logic.
The Analyzer now has this createComponents() method calling the custom
TokenStreamComponents class with my custom Tokenizer:

protected TokenStreamComponents createComponents(String fieldName,
Reader reader) {
  final Tokenizer source = new KoraTokenizer(reader);
  final TokenStreamComponents tokenstream = new
  try {
  } catch (IOException e) {
  return tokenstream;

The custom TokenStreamComponents class uses this constructor:

public KoraTokenStreamComponents(Tokenizer tokenizer) {
  try {
  } catch (IOException e) {
    // TODO Auto-generated catch block

Since I have not changed anything in the Tokenizer, I suspect the error
to be in the new class KoraTokenStreamComponents. This may be due to the
fact that I do not fully understand why the TokenStreamComponents class
has been introduced.
Any hints on that? Thanks!

Institut für Deutsche Sprache |
Projekt KorAP                 |
Tel. +49-(0)621-43740789      |
Korpusanalyseplattform der nächsten Generation
Next Generation Corpus Analysis Platform

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message