Mailing-List: contact lucene-dev-help@jakarta.apache.org; run by ezmlm
Precedence: bulk
Reply-To: "Lucene Developers List" <lucene-dev@jakarta.apache.org>
Message-ID: <20021120162333.40641.qmail@web12703.mail.yahoo.com>
Date: Wed, 20 Nov 2002 08:23:33 -0800 (PST)
From: Otis Gospodnetic <otis_gospodnetic@yahoo.com>
Subject: Re: Observations: profiling indexing process
To: Lucene Developers List <lucene-dev@jakarta.apache.org>
In-Reply-To: <20021120074053.H5591@lx.quiotix.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii

I realized soon after I sent the message that this is the case and I
knew somebody would quickly point it out :)
Still, if the effort to improve a piece is costless, why not do it :)

I changed my code locally to use HashMap.  I actually started with
HashSet, but with Sets one can't do set.get(object) :(

Anyhow, yes, there are bigger things to fix.

Otis

--- Brian Goetz <brian@quiotix.com> wrote:
> > > > I decided to run a little Lucene app that does some
> > > > indexing under a
> > > > profiler. (I used JMP,
> > > > http://www.khelekore.org/jmp/, a rather simple
> > > > one).
> > > > 
> > > > The app uses StandardAnalyzer.
> > > > I've noticed that a lot of time is spent in
> > > > StandardTokenizer and
> > > > various JavaCC-generated methods.
> > > > I am wondering if anyone tried replacing
> > > > StandardTokenizer.jj with
> > > > something more efficient?
> > > > 
> > > > Also,StopFilter is using a Hashtable to store the
> > > > list of stop words. 
> > > > Has anyone tried using HashMap instead?
> 
> HashMap is certainly a higher-performance choice, so long as the map
> is static for the duration of its lifetime and built in the
> constructor.  Otherwise, you could run afoul of thread-safety issues.
> And HashSet uses less memory.  
>
> But the bigger point is one that Doug convinced me of only after I
> went on a mad micro-optimization tear earlier in the project (Sorry,
> Doug, you were right) -- and that is that for the most part,
> tokenization is a very very small part of the total work done by the
> system.  Tokenization gets done once for each document, wheras the
> document gets merged, searched, and queried many times.  Time spent
> tweaking tokenizers for performance is likely wasted effort; that
> time
> could probably be much better spent improving the code in much more
> useful ways.
> 
> Sure, StandardToeknizer is slow.  But that tokenization effort gets
> spread over the many times the document is searched.  Even if it does
> a 1% better job at tokenizing, that might be worth a 100x increase in
> tokenizing time.  I think any effort you want to spend tweaking
> tokenizers would be much better spent doing a better job of
> toeknization and preprocessing (stemming, dealing intelligently with
> non-letters and word breaks, format stripping) than on performance
> tweaks.
> 
> 
> 
> --
> To unsubscribe, e-mail:  
> <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
> For additional commands, e-mail:
> <mailto:lucene-dev-help@jakarta.apache.org>
> 


__________________________________________________
Do you Yahoo!?
Yahoo! Web Hosting - Let the expert host your site
http://webhosting.yahoo.com

--
To unsubscribe, e-mail:   <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>