lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From eks dev <eks...@yahoo.co.uk>
Subject Re: Token termBuffer issues
Date Sat, 21 Jul 2007 16:31:26 GMT
as a side note, 
Maybe a good point to mention it, clever people from mg4j project solved this polymorphic
needs for char[]/String (fast modification/compact, fast hashing, safety) using something
called MutableString, it is worth reading javadoc there. 



----- Original Message ----
From: eks dev <eksdev@yahoo.co.uk>
To: java-dev@lucene.apache.org
Sent: Saturday, 21 July, 2007 6:23:25 PM
Subject: Re: Token termBuffer issues


On 7/21/07, Michael McCandless <lucene@mikemccandless.com> wrote:
>> To further improve "out of the box" performance I would really also
>> like to fix the core analyzers, when possible, to re-use a single
>> Token instance for each Token they produce.  This would then mean no
>> objects are created as you step through Tokens in the TokenStream
>> ... so this would give the best performance.

>How much better I wonder?  Small object allocation & reclaiming is
>supposed to be very good in current JVMs.


Sorry I cannot give you exact numbers now, but I know for sure that we decided to take "real
analysis" into separate phase that gets executed before entering Lucene TokenStreram and Indexing
due to String in Token and than do just the simple whitespace tokenisation during indexing.
And this was not just out for fun, there was some real benefit in it.  
The issue with performance here is in making transformations on tokens during analysis (where
this applies), you gave  very nice example , stemming, that itself  generates  new  Strings,
another nice example is NGram generation in SpellChecker that generates rater huge numbers
of small objects. 

The simplest model, tokenize(without modifying)/index ironically also benefits from char[]
as than things go really fast in general  so new String() on the way  gets noticed  in profiler.
 While testing new indexing code from Mike, we also changed  our vanilla Tokenizer to use
termBuffer and there was again something like 10-15% boost.

It's been a while since that so I do not know exact numbers, but I learned this many times
the hard way, nothing beats char[] when it comes to text processing. 

To stop bothering you people, IMHO, there is a hard work in Analyzer chain to be done  before
Token gets ready for prime time in Lucene core, and this is the place where having String
overproduction hurts. 







      ___________________________________________________________
Yahoo! Answers - Got a question? Someone out there knows the answer. Try it
now.
http://uk.answers.yahoo.com/


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org






      ___________________________________________________________
Yahoo! Answers - Got a question? Someone out there knows the answer. Try it
now.
http://uk.answers.yahoo.com/ 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message