lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Scott Smith <SSm...@MainstreamData.com>
Subject RE: Performance question
Date Thu, 08 Jan 2004 02:24:22 GMT
After two rather frustrating days, I find I need to apologize to Lucene.  My
last run of 225 messages averaged around 25 milliseconds per message--that's
parsing the xml, creating the Document, and putting it in the index (2.5Ghz
cpu, 1G ram).  Turns out the performance problem was xerces sax "helping me"
by loading the DTD before it parsed each message and the DTD wasn't local to
our site.  After seeing Terry's response, I knew there had to be more going
on than what I was assuming.

Thanks for the suggestions.  I wonder how much faster I can go if I
implement some of those?

Regards

Scott  

-----Original Message-----
From: Terry Steichen [mailto:terry@net-frame.com] 
Sent: Tuesday, January 06, 2004 5:48 AM
To: Lucene Users List
Subject: Re: Performance question


Scott,

Here are some figures to use for comparision.  Using the latest Lucene
release, I index about 200 similar-sized XML files at a time, on a Windows
XP machine (2Ghz).  First I create a new index, which adds the documents at
a rate of about 8 per second (I don't recall what the cpu % is during this).
Then I merge this new index with the master one (using, I think, the default
merge factor), which takes about 4.5 minutes (during which time the cpu
utilization stays near 100%).  The master index currently holds about
115,000 such documents.

HTH,

Regards,

Terry

----- Original Message -----
From: "Scott Smith" <SSmith@MainstreamData.com>
To: <lucene-user@jakarta.apache.org>
Sent: Monday, January 05, 2004 10:26 PM
Subject: Performance question


> I have an application that is reading in XML files and indexing them.
Each
> XML file is 3K-6K bytes.  This application preloads a database that I 
> will add to "on the fly" later.  However, all I want it to do 
> initially is take some existing files and create the initial index as 
> quick as I can.
>
> Since I want to index "on the fly" later, I set the merge factor to 
> 10.
I'm
> assuming that I can't create the index initially with one merge factor 
> (e.g., 100) and then change the merge factor later (true?).
>
> What I see is that it takes 1-3 seconds per xml file to do the index.
This
> means I'm indexing around 150k bytes per minute.  I also notice that 
> the
CPU
> utilization rarely exceeds 5% (looking at task manager on a Windows 
> box).
I
> use Xerces to read in the files (SAX interface) and I don't close or 
> optimize the index between stories nor do I sleep anyplace.  I've 
> looked
at
> the page fault numbers and they aren't changing much.  I guess I would
have
> expected that I would have pretty much pegged the CPU and seen much 
> faster indexing.
>
> Any ideas/suggestions?
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message