lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Brian Goetz <>
Subject Re: Should Token be immutable?
Date Thu, 12 Dec 2002 20:37:37 GMT
> I suppose org.apache.lucene.analysis.LowerCaseFilter and
> PorterStemFilter modify the Token termText property as an
> optimization. Their next() method will be called once for each token
> for each filter in the chain of filters during analysis. Creating a
> new Token for every modification could create a _lot_ of objects to
> be garbage collected.

Well, in the worst case, it would multiply by 1.5 the number of
objects created by the tokenization process, since the Token is mostly
a wrapper for the term text (a String), and the term text needs to be
created regardless (two objects -- the String object and its
underlying char array.)  And if the term text is put together with
StringBuffer operations (explicitly or implicitly), that's more
objects.  But the most likely case is considerably better; there are
lots of other intermediate objects created as a result of
tokenization, too.  I'm guessing that you're talking about no more
than a 15% increase in garbage creation during this particular part of
the process.  And JVMs have gotten a LOT smarter about dealing with
temporary intermediate objects.

Immutability has a lot of advantages.  True, it sometimes means a
performance hit.  But are you sure we've really got one here?  First,
the time spent in tokenization is a small percentage of Lucene's
overall work, as searches are generally more frequent than additions
(otherwise we'd just use grep.)  Second, does anyone feel that
Lucene's tokenization is too slow?  I sure don't.  

Unless someone can demonstrate a real performance problem, I think
we're better off doing it "right" -- make Term immutable.  

To unsubscribe, e-mail:   <>
For additional commands, e-mail: <>

View raw message