lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Brian Goetz <br...@quiotix.com>
Subject Re: Observations: profiling indexing process
Date Wed, 20 Nov 2002 15:40:53 GMT
> > > I decided to run a little Lucene app that does some
> > > indexing under a
> > > profiler. (I used JMP,
> > > http://www.khelekore.org/jmp/, a rather simple
> > > one).
> > > 
> > > The app uses StandardAnalyzer.
> > > I've noticed that a lot of time is spent in
> > > StandardTokenizer and
> > > various JavaCC-generated methods.
> > > I am wondering if anyone tried replacing
> > > StandardTokenizer.jj with
> > > something more efficient?
> > > 
> > > Also,StopFilter is using a Hashtable to store the
> > > list of stop words. 
> > > Has anyone tried using HashMap instead?

HashMap is certainly a higher-performance choice, so long as the map
is static for the duration of its lifetime and built in the
constructor.  Otherwise, you could run afoul of thread-safety issues.
And HashSet uses less memory.  

But the bigger point is one that Doug convinced me of only after I
went on a mad micro-optimization tear earlier in the project (Sorry,
Doug, you were right) -- and that is that for the most part,
tokenization is a very very small part of the total work done by the
system.  Tokenization gets done once for each document, wheras the
document gets merged, searched, and queried many times.  Time spent
tweaking tokenizers for performance is likely wasted effort; that time
could probably be much better spent improving the code in much more
useful ways.

Sure, StandardToeknizer is slow.  But that tokenization effort gets
spread over the many times the document is searched.  Even if it does
a 1% better job at tokenizing, that might be worth a 100x increase in
tokenizing time.  I think any effort you want to spend tweaking
tokenizers would be much better spent doing a better job of
toeknization and preprocessing (stemming, dealing intelligently with
non-letters and word breaks, format stripping) than on performance
tweaks.



--
To unsubscribe, e-mail:   <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>


Mime
View raw message