lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Carsten Schnober <>
Subject Apply custom tokenization
Date Tue, 06 Mar 2012 14:40:18 GMT
Dear list,
I have a quite specific issue on which I would appreciate very much
having some thoughts before I start the actual implementation. Here's my
task description:
I would like to index corpora that have already been tokenized by an
external tokenizer. This tokenization is stored in an external file and
is the one I want to use for the Lucene index too. For each document,
there is a file that describes each token in the document by character
offsets, e.g. "<token start="0" end="3" />". Leave aside the XML format,
I'll write an appropriate XML parser so that we just have that
tokenization information.
I do not want do to any additional analysis on the input text, i.e. no
stopword filtering etc.; each token that is specified in the external
tokenization is supposed to result in an indexed token.

My approach to achieve this goal would be to implement an Analyzer that
reads the external tokenization information and generates a TokenStream
containing all the Token objects with offsets set according to the
external tokenization, i.e. without an own Tokenizer implementation. I'm
working with Lucene 3.5, which is why one very concrete question at this
point is: how would you implement this using the Attribute interface;
still use Token objects or can/should I work around them at all? The
documentation is quite vague about that point and so is the "Lucene in
Action (2nd ed.)" textbook.

The background is that I need to allow different tokenizations, so there
will potentially be multiple indexes for a text. Queries will have to be
tokenized by a user-defined tokenizer and the suitable index will then
be searched. So what are your thoughts about that approach? Is it the
right strategy for the task? Please recall that a given fact is that the
tokenization has to be read from an external file.

In general, I am afraid that the Lucene almost hardwires the analysis
process. Even though it does allow for custom tokenizers to be
implemented, it does not seem to intended that one does come up with a
completely self-made text analysis process, is it?

Thank you very much!

Carsten Schnober
Institut für Deutsche Sprache |
Projekt KorAP -- Korpusanalyseplattform der nächsten Generation
Tel.: +49-(0)621-1581-238

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message