lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dror Matalon <d...@zapatec.com>
Subject Re: Performance question
Date Thu, 08 Jan 2004 03:48:56 GMT
On Wed, Jan 07, 2004 at 07:24:22PM -0700, Scott Smith wrote:
> After two rather frustrating days, I find I need to apologize to Lucene.  My
> last run of 225 messages averaged around 25 milliseconds per message--that's
> parsing the xml, creating the Document, and putting it in the index (2.5Ghz
> cpu, 1G ram).  Turns out the performance problem was xerces sax "helping me"
> by loading the DTD before it parsed each message and the DTD wasn't local to
> our site.  After seeing Terry's response, I knew there had to be more going
> on than what I was assuming.
> 
> Thanks for the suggestions.  I wonder how much faster I can go if I
> implement some of those?

25 msecs to insert a document is on the high side, but it depends of
course on the size of your document. You're probably spending 90% of
your time in the XML parsing. I believe that there are other parsers
that are faster than xerces, you might want to look at these. You might
want to look at http://dom4j.org/.

Dror


> 
> Regards
> 
> Scott  
> 
> -----Original Message-----
> From: Terry Steichen [mailto:terry@net-frame.com] 
> Sent: Tuesday, January 06, 2004 5:48 AM
> To: Lucene Users List
> Subject: Re: Performance question
> 
> 
> Scott,
> 
> Here are some figures to use for comparision.  Using the latest Lucene
> release, I index about 200 similar-sized XML files at a time, on a Windows
> XP machine (2Ghz).  First I create a new index, which adds the documents at
> a rate of about 8 per second (I don't recall what the cpu % is during this).
> Then I merge this new index with the master one (using, I think, the default
> merge factor), which takes about 4.5 minutes (during which time the cpu
> utilization stays near 100%).  The master index currently holds about
> 115,000 such documents.
> 
> HTH,
> 
> Regards,
> 
> Terry
> 
> ----- Original Message -----
> From: "Scott Smith" <SSmith@MainstreamData.com>
> To: <lucene-user@jakarta.apache.org>
> Sent: Monday, January 05, 2004 10:26 PM
> Subject: Performance question
> 
> 
> > I have an application that is reading in XML files and indexing them.
> Each
> > XML file is 3K-6K bytes.  This application preloads a database that I 
> > will add to "on the fly" later.  However, all I want it to do 
> > initially is take some existing files and create the initial index as 
> > quick as I can.
> >
> > Since I want to index "on the fly" later, I set the merge factor to 
> > 10.
> I'm
> > assuming that I can't create the index initially with one merge factor 
> > (e.g., 100) and then change the merge factor later (true?).
> >
> > What I see is that it takes 1-3 seconds per xml file to do the index.
> This
> > means I'm indexing around 150k bytes per minute.  I also notice that 
> > the
> CPU
> > utilization rarely exceeds 5% (looking at task manager on a Windows 
> > box).
> I
> > use Xerces to read in the files (SAX interface) and I don't close or 
> > optimize the index between stories nor do I sleep anyplace.  I've 
> > looked
> at
> > the page fault numbers and they aren't changing much.  I guess I would
> have
> > expected that I would have pretty much pegged the CPU and seen much 
> > faster indexing.
> >
> > Any ideas/suggestions?
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> > For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> >
> >
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> 

-- 
Dror Matalon
Zapatec Inc 
1700 MLK Way
Berkeley, CA 94709
http://www.fastbuzz.com
http://www.zapatec.com

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message