From stephane vaucher <vauc...@LUB.UMontreal.CA>
Subject Re: Should Token be immutable?
Date Fri, 13 Dec 2002 19:31:23 GMT
Hi Brian,

I've read rapidly through the analyser's code, but I'm in no way a 
lucene master. If I understood your statement correctly, you are saying 
that we would multiply the number of tokens by 1.5 per tokeniser it 
uses. A potential "optimisation" would be that sometimes the string 
could be reused since it's immutable as well.

Personally, I believe it would be cleaner to make it immutable (I think 
that's why this thread started), so +1.


Brian Goetz wrote:

>>I suppose org.apache.lucene.analysis.LowerCaseFilter and
>>PorterStemFilter modify the Token termText property as an
>>optimization. Their next() method will be called once for each token
>>for each filter in the chain of filters during analysis. Creating a
>>new Token for every modification could create a _lot_ of objects to
>>be garbage collected.
>Well, in the worst case, it would multiply by 1.5 the number of
>objects created by the tokenization process, since the Token is mostly
>a wrapper for the term text (a String), and the term text needs to be
>created regardless (two objects -- the String object and its
>underlying char array.)  And if the term text is put together with
>StringBuffer operations (explicitly or implicitly), that's more
>objects.  But the most likely case is considerably better; there are
>lots of other intermediate objects created as a result of
>tokenization, too.  I'm guessing that you're talking about no more
>than a 15% increase in garbage creation during this particular part of
>the process.  And JVMs have gotten a LOT smarter about dealing with
>temporary intermediate objects.
>Immutability has a lot of advantages.  True, it sometimes means a
>performance hit.  But are you sure we've really got one here?  First,
>the time spent in tokenization is a small percentage of Lucene's
>overall work, as searches are generally more frequent than additions
>(otherwise we'd just use grep.)  Second, does anyone feel that
>Lucene's tokenization is too slow?  I sure don't.  
>Unless someone can demonstrate a real performance problem, I think
>we're better off doing it "right" -- make Term immutable.  
