lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Endre StĂžlsvik <>
Subject Re: new Token API
Date Wed, 21 Nov 2007 12:05:54 GMT
Yonik Seeley wrote:
> On Nov 19, 2007 7:02 PM, Doug Cutting <> wrote:
>> Yonik Seeley wrote:
>>> 1) If we are deprecating some methods like String termText(), how
>>> about at the same time deprecating "String type"?  If we want
>>> lightweight per-token metadata for communication between filters, an
>>> int or a long used as a bitvector (32 or 64 independent boolean vars
>>> per token) would be much more useful than a single String.
>> There are tokenizers that use the type string, e.g., StandardFilter &
>> similar things in Nutch.  How would you replace such uses?  Add a bit
>> for each token type?  Is that really that much more useful?
> It is, given that it enables a token to have more than one type at once.
> The benefit is probably relatively minor (the number of people who
> would use it), and I wouldn't have brought it up except that it could
> piggy-back on the other recent changes to Token.

I'm just a lurker!

However, I'll chime in and say that this sounds interesting. But please 
use a long if you do such a thing - better to have some extra bits 
available for future, and given that most future lucenes' will run on 64 
bit system, such a thing shouldn't give a performance impact.

You could however use a String[], or use Set<String>, to communicate (or 
potentially use "comma-separated values" in the one String, but this 
makes uniquely identifying your particular token somewhat messy). Will 
the restriction of 32 core bits and 32 user bits ever be a problem? What 
about completely different usages, like categorizing something into an 
indefinite number of bins? (Just to be the devil's advocate..)

A Michael mentioned setting some reference to null, with the result 
being that GC kicked in more often. If this is the case for that 
particular scenario, then please don't optimize along those lines. 
Getting rid of your never-to-be-used-again objects as fast as possible 
is _always_ good, and if it in some strange situation seems opposite, 
then that will probably change radically in the next iteration of GC 
development - or for example by setting the huge bunch of GC selection 
and tuning parameters correct .. or something..

With that said, obviously reusing the char[] is the better way to go: 
not creating an object at all is of course better than dropping an 
object, then recreate the same thing moments afterwards.

Have you run your profilers on this question? Seems like a prudent thing 
to do if you're in a situation where some API will change any way.

Thanks for reading my ramblings,

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message