Return-Path: Delivered-To: apmail-jakarta-lucene-dev-archive@apache.org Received: (qmail 81780 invoked from network); 20 Nov 2002 16:23:56 -0000 Received: from unknown (HELO nagoya.betaversion.org) (192.18.49.131) by daedalus.apache.org with SMTP; 20 Nov 2002 16:23:56 -0000 Received: (qmail 16936 invoked by uid 97); 20 Nov 2002 16:24:44 -0000 Delivered-To: qmlist-jakarta-archive-lucene-dev@jakarta.apache.org Received: (qmail 16760 invoked by uid 97); 20 Nov 2002 16:24:42 -0000 Mailing-List: contact lucene-dev-help@jakarta.apache.org; run by ezmlm Precedence: bulk List-Unsubscribe: List-Subscribe: List-Help: List-Post: List-Id: "Lucene Developers List" Reply-To: "Lucene Developers List" Delivered-To: mailing list lucene-dev@jakarta.apache.org Received: (qmail 16693 invoked by uid 98); 20 Nov 2002 16:24:41 -0000 X-Antivirus: nagoya (v4218 created Aug 14 2002) Message-ID: <20021120162333.40641.qmail@web12703.mail.yahoo.com> Date: Wed, 20 Nov 2002 08:23:33 -0800 (PST) From: Otis Gospodnetic Subject: Re: Observations: profiling indexing process To: Lucene Developers List In-Reply-To: <20021120074053.H5591@lx.quiotix.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Spam-Rating: daedalus.apache.org 1.6.2 0/1000/N X-Spam-Rating: daedalus.apache.org 1.6.2 0/1000/N I realized soon after I sent the message that this is the case and I knew somebody would quickly point it out :) Still, if the effort to improve a piece is costless, why not do it :) I changed my code locally to use HashMap. I actually started with HashSet, but with Sets one can't do set.get(object) :( Anyhow, yes, there are bigger things to fix. Otis --- Brian Goetz wrote: > > > > I decided to run a little Lucene app that does some > > > > indexing under a > > > > profiler. (I used JMP, > > > > http://www.khelekore.org/jmp/, a rather simple > > > > one). > > > > > > > > The app uses StandardAnalyzer. > > > > I've noticed that a lot of time is spent in > > > > StandardTokenizer and > > > > various JavaCC-generated methods. > > > > I am wondering if anyone tried replacing > > > > StandardTokenizer.jj with > > > > something more efficient? > > > > > > > > Also,StopFilter is using a Hashtable to store the > > > > list of stop words. > > > > Has anyone tried using HashMap instead? > > HashMap is certainly a higher-performance choice, so long as the map > is static for the duration of its lifetime and built in the > constructor. Otherwise, you could run afoul of thread-safety issues. > And HashSet uses less memory. > > But the bigger point is one that Doug convinced me of only after I > went on a mad micro-optimization tear earlier in the project (Sorry, > Doug, you were right) -- and that is that for the most part, > tokenization is a very very small part of the total work done by the > system. Tokenization gets done once for each document, wheras the > document gets merged, searched, and queried many times. Time spent > tweaking tokenizers for performance is likely wasted effort; that > time > could probably be much better spent improving the code in much more > useful ways. > > Sure, StandardToeknizer is slow. But that tokenization effort gets > spread over the many times the document is searched. Even if it does > a 1% better job at tokenizing, that might be worth a 100x increase in > tokenizing time. I think any effort you want to spend tweaking > tokenizers would be much better spent doing a better job of > toeknization and preprocessing (stemming, dealing intelligently with > non-letters and word breaks, format stripping) than on performance > tweaks. > > > > -- > To unsubscribe, e-mail: > > For additional commands, e-mail: > > __________________________________________________ Do you Yahoo!? Yahoo! Web Hosting - Let the expert host your site http://webhosting.yahoo.com -- To unsubscribe, e-mail: For additional commands, e-mail: