lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Shai Erera" <>
Subject Re: Add Token.copyInto(Token) API
Date Thu, 13 Nov 2008 15:14:57 GMT
Thanks. I am aware of this thread. Indeed it will change the way
TokenStreams are handled, and so copying a Token may not be necessary.
However, I can't tell now whether this won't be necessary - I guess I'll
just have to wait until it's out and I start using it :-)

Anyway, I've implemented it for myself, and thought this might be a nice
contribution. I can live without it in Lucene :-)


On Wed, Nov 12, 2008 at 10:02 PM, Grant Ingersoll <>wrote:

> Are you aware of LUCENE-1422?  There is likely going to be a new way of
> dealing w/ TokenStreams all together, so you might want to have a look there
> before continuing.
> On Nov 12, 2008, at 1:51 PM, Shai Erera wrote:
>  Hi,
>> I was thinking about adding a copyInto method to Token. The only way to
>> clone a token is by using its clone() or clone(char[], int, int, int, int)
>> methods. Both do the job, but allocate a Token instance. While in 2.4 a
>> Token constructor may actually get a char[] as input (thus saving a char[]
>> allocation), but it still allocates an instance.
>> Even though the instance allocation is not that expensive, it does
>> allocate additional things, like String for the type, Payload and String
>> (for the text, even though that will be removed in 3.0).
>> If an application wishes to keep one instance of Token around, and copy
>> into it other Tokens, it can call various methods to achieve that, like
>> setTermBuffer, setOffset etc. A copyInto is just a convenient method for
>> doing that.
>> If you wonder about the use case, then here it is: I know that it's
>> advised to reuse the same Token instance in the TokenStream API (basically
>> make sure to call next(Token). But there might be TokenFilters which will
>> need to save a certain occurrance of a token, do some processing and return
>> it later. A good example is StemmingFilter. One can think of such a filter
>> to return the original token in addition to the stemmed token (for examle,
>> for the word "tokens" in English, it will return "tokens" [original] and
>> "token" [stem]). In that case, the filter has to save the word "tokens" so
>> that it returns "tokens" first (or the stem, the order does not matter) and
>> next time its next(Token) is called, it should return the stem (or
>> original), before comsuming the next token from the TokenStream.
>> Anyway, I hope it's clear enough, but if not I can elaborate.
>> If you think a copyInto() is worth the effort, I can quickly create a
>> patch for it).
>> Shai
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

View raw message