lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Shai Erera" <ser...@gmail.com>
Subject Add Token.copyInto(Token) API
Date Wed, 12 Nov 2008 18:51:19 GMT
Hi,

I was thinking about adding a copyInto method to Token. The only way to
clone a token is by using its clone() or clone(char[], int, int, int, int)
methods. Both do the job, but allocate a Token instance. While in 2.4 a
Token constructor may actually get a char[] as input (thus saving a char[]
allocation), but it still allocates an instance.

Even though the instance allocation is not that expensive, it does allocate
additional things, like String for the type, Payload and String (for the
text, even though that will be removed in 3.0).
If an application wishes to keep one instance of Token around, and copy into
it other Tokens, it can call various methods to achieve that, like
setTermBuffer, setOffset etc. A copyInto is just a convenient method for
doing that.

If you wonder about the use case, then here it is: I know that it's advised
to reuse the same Token instance in the TokenStream API (basically make sure
to call next(Token). But there might be TokenFilters which will need to save
a certain occurrance of a token, do some processing and return it later. A
good example is StemmingFilter. One can think of such a filter to return the
original token in addition to the stemmed token (for examle, for the word
"tokens" in English, it will return "tokens" [original] and "token" [stem]).
In that case, the filter has to save the word "tokens" so that it returns
"tokens" first (or the stem, the order does not matter) and next time its
next(Token) is called, it should return the stem (or original), before
comsuming the next token from the TokenStream.

Anyway, I hope it's clear enough, but if not I can elaborate.
If you think a copyInto() is worth the effort, I can quickly create a patch
for it).

Shai

Mime
View raw message