lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "MOYSE Gilles (Cetelem)" <>
Subject RE: multiple tokens from a single input token
Date Mon, 10 Nov 2003 15:01:58 GMT

I experienced the same problem, and I used the following solution (maybe not
the good one, but it works, and not too slowly).
The problem was to detect synonyms. I used a synonyms file, made up of that
kind of lines :
	a b c
	d e f

to define a, b, and c as synonyms, and d, e and f as ohter synonyms.
So, when my filter received a token 'b' for instance, I wanted it to return
three tokens, 'a', 'b' and 'c'.
I used a FIFO stack to solve that.
When the filter receives a token, it checks whether the stack is empty or
not. If it is, then it returns the received token. If it is not empty, then
it returns the poped (i.e. the first which was pushed. It's better to use a
FIFO stack to keep a correct order) value from the stack.
When you receive the 'null' token, indicating the end of stream, then you
continue returning the poped values from yoour stack until it is empty. Then
you return 'null'.

Hope it will help.

Gilles Moyse

-----Message d'origine-----
De : Peter Keegan []
Envoyé : lundi 10 novembre 2003 15:43
À : Lucene user's list
Objet : re: multiple tokens from a single input token

I would appreciate some clarification on how to generate multiple tokens
from a single input token.

In a previous message: (see:,
Pierrick Brihaye provides the following code:

public final Token next() throws IOException {
while (true) {
String emittedText; 
int positionIncrement = 0; 
//New token ?
if (receivedText.length() == 0) {
receivedToken =; 
if (receivedToken == null) return null;
positionIncrement = 1; 
emittedText = getNextPart(); 
//Warning : all tokens are emitted with the *same* offset
if (emittedText.length() > 0) {
Token emittedToken = new Token(emittedText, receivedToken.startOffset(),
return emittedToken;
I assume that you would extend the TokenFilter class and override the 'next'
method. But what I don't understand is how you return more than one Token
(with different settings for 'setpositionIncrement') if the 'next' method is
only called once for each input token.

For example, when my custom filter's 'next()' method receives token 'A' from
'DocumentWriter.invertDocument()', it wants to return token 'A' and token
'B' at the same postion. How is this done? It seems I can only return one
token at a time from 'next()'. I think I'm missing something obvious :-(


  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message