Return-Path: Delivered-To: apmail-jakarta-lucene-user-archive@www.apache.org Received: (qmail 85405 invoked from network); 9 Jan 2004 02:00:40 -0000 Received: from daedalus.apache.org (HELO mail.apache.org) (208.185.179.12) by minotaur-2.apache.org with SMTP; 9 Jan 2004 02:00:40 -0000 Received: (qmail 10939 invoked by uid 500); 9 Jan 2004 02:00:19 -0000 Delivered-To: apmail-jakarta-lucene-user-archive@jakarta.apache.org Received: (qmail 10838 invoked by uid 500); 9 Jan 2004 02:00:18 -0000 Mailing-List: contact lucene-user-help@jakarta.apache.org; run by ezmlm Precedence: bulk List-Unsubscribe: List-Subscribe: List-Help: List-Post: List-Id: "Lucene Users List" Reply-To: "Lucene Users List" Delivered-To: mailing list lucene-user@jakarta.apache.org Received: (qmail 10825 invoked from network); 9 Jan 2004 02:00:18 -0000 Received: from unknown (HELO sideshow.mainstreamdata.com) (209.63.42.31) by daedalus.apache.org with SMTP; 9 Jan 2004 02:00:18 -0000 Received: by sideshow.mainstreamdata.com with Internet Mail Service (5.5.2657.72) id ; Thu, 8 Jan 2004 19:00:26 -0700 Message-ID: <039AE64F5C9D7C44A0FE1AD71DD835CF78099C@sideshow.mainstreamdata.com> From: Scott Smith To: 'Lucene Users List' Subject: RE: Performance question Date: Thu, 8 Jan 2004 19:00:24 -0700 MIME-Version: 1.0 X-Mailer: Internet Mail Service (5.5.2657.72) Content-Type: text/plain X-Spam-Rating: daedalus.apache.org 1.6.2 0/1000/N X-Spam-Rating: minotaur-2.apache.org 1.6.2 0/1000/N The parsing I do currently is pretty straight forward. There are only four tags I look for (and one of those tags typically encompasses most of the file). Sax works great though I'm not stuck on using xerces. In the short-run, the 25 millisecond is quite acceptable (where, for obvious reasons, the 1.2 seconds was not). In the long-run, sounds like I need to look at some other options besides xerces. Another thing I noticed doing this is that the xeres sax interface tends to pass small blocks of characters (typically, around 50 characters) on each character callback even when there are several thousand bytes of character data in the tag. Currently, I add each block of characters to the Document separately. This means I often end up with 100 or more items on the Document linked list for the same field. When I get some time, I would like to see if things work faster if I accumulate these into a StringBuffer and pass them to the document as one large block instead of a lot of little blocks. Thanks for all of the suggestions. Scott -----Original Message----- From: Andrzej Bialecki [mailto:ab@getopt.org] Sent: Thursday, January 08, 2004 5:24 AM To: Lucene Users List Subject: Re: Performance question Dror Matalon wrote: >On Wed, Jan 07, 2004 at 07:24:22PM -0700, Scott Smith wrote: > > >>After two rather frustrating days, I find I need to apologize to >>Lucene. My last run of 225 messages averaged around 25 milliseconds >>per message--that's parsing the xml, creating the Document, and >>putting it in the index (2.5Ghz cpu, 1G ram). Turns out the >>performance problem was xerces sax "helping me" by loading the DTD >>before it parsed each message and the DTD wasn't local to our site. >>After seeing Terry's response, I knew there had to be more going on >>than what I was assuming. >> >>Thanks for the suggestions. I wonder how much faster I can go if I >>implement some of those? >> >> > >25 msecs to insert a document is on the high side, but it depends of >course on the size of your document. You're probably spending 90% of >your time in the XML parsing. I believe that there are other parsers >that are faster than xerces, you might want to look at these. You might >want to look at http://dom4j.org/. > >Dror > > > You may want to check the XML Pull Parser - it offers something between SAX and DOM, with performance similar to SAX. (http://www.extreme.indiana.edu/xgws/xsoap/xpp) -- Best regards, Andrzej Bialecki ------------------------------------------------- Software Architect, System Integration Specialist CEN/ISSS EC Workshop, ECIMF project chair EU FP6 E-Commerce Expert/Evaluator ------------------------------------------------- FreeBSD developer (http://www.freebsd.org) --------------------------------------------------------------------- To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org For additional commands, e-mail: lucene-user-help@jakarta.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org For additional commands, e-mail: lucene-user-help@jakarta.apache.org