lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Terry Steichen" <te...@net-frame.com>
Subject Re: Performance question
Date Thu, 08 Jan 2004 04:38:25 GMT
Scott,

FYI - I happen to be using dom4j.  I might add, however, that I parse each
document into about 18 fields.  If, as Dror says, your document is simpler,
(or if you increase the merge factor) you can probably better that.

Regards,

Terry

----- Original Message -----
From: "Dror Matalon" <dror@zapatec.com>
To: "Lucene Users List" <lucene-user@jakarta.apache.org>
Sent: Wednesday, January 07, 2004 10:48 PM
Subject: Re: Performance question


> On Wed, Jan 07, 2004 at 07:24:22PM -0700, Scott Smith wrote:
> > After two rather frustrating days, I find I need to apologize to Lucene.
My
> > last run of 225 messages averaged around 25 milliseconds per
message--that's
> > parsing the xml, creating the Document, and putting it in the index
(2.5Ghz
> > cpu, 1G ram).  Turns out the performance problem was xerces sax "helping
me"
> > by loading the DTD before it parsed each message and the DTD wasn't
local to
> > our site.  After seeing Terry's response, I knew there had to be more
going
> > on than what I was assuming.
> >
> > Thanks for the suggestions.  I wonder how much faster I can go if I
> > implement some of those?
>
> 25 msecs to insert a document is on the high side, but it depends of
> course on the size of your document. You're probably spending 90% of
> your time in the XML parsing. I believe that there are other parsers
> that are faster than xerces, you might want to look at these. You might
> want to look at http://dom4j.org/.
>
> Dror
>
>
> >
> > Regards
> >
> > Scott
> >
> > -----Original Message-----
> > From: Terry Steichen [mailto:terry@net-frame.com]
> > Sent: Tuesday, January 06, 2004 5:48 AM
> > To: Lucene Users List
> > Subject: Re: Performance question
> >
> >
> > Scott,
> >
> > Here are some figures to use for comparision.  Using the latest Lucene
> > release, I index about 200 similar-sized XML files at a time, on a
Windows
> > XP machine (2Ghz).  First I create a new index, which adds the documents
at
> > a rate of about 8 per second (I don't recall what the cpu % is during
this).
> > Then I merge this new index with the master one (using, I think, the
default
> > merge factor), which takes about 4.5 minutes (during which time the cpu
> > utilization stays near 100%).  The master index currently holds about
> > 115,000 such documents.
> >
> > HTH,
> >
> > Regards,
> >
> > Terry
> >
> > ----- Original Message -----
> > From: "Scott Smith" <SSmith@MainstreamData.com>
> > To: <lucene-user@jakarta.apache.org>
> > Sent: Monday, January 05, 2004 10:26 PM
> > Subject: Performance question
> >
> >
> > > I have an application that is reading in XML files and indexing them.
> > Each
> > > XML file is 3K-6K bytes.  This application preloads a database that I
> > > will add to "on the fly" later.  However, all I want it to do
> > > initially is take some existing files and create the initial index as
> > > quick as I can.
> > >
> > > Since I want to index "on the fly" later, I set the merge factor to
> > > 10.
> > I'm
> > > assuming that I can't create the index initially with one merge factor
> > > (e.g., 100) and then change the merge factor later (true?).
> > >
> > > What I see is that it takes 1-3 seconds per xml file to do the index.
> > This
> > > means I'm indexing around 150k bytes per minute.  I also notice that
> > > the
> > CPU
> > > utilization rarely exceeds 5% (looking at task manager on a Windows
> > > box).
> > I
> > > use Xerces to read in the files (SAX interface) and I don't close or
> > > optimize the index between stories nor do I sleep anyplace.  I've
> > > looked
> > at
> > > the page fault numbers and they aren't changing much.  I guess I would
> > have
> > > expected that I would have pretty much pegged the CPU and seen much
> > > faster indexing.
> > >
> > > Any ideas/suggestions?
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> > > For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> > >
> > >
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> > For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> > For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> >
>
> --
> Dror Matalon
> Zapatec Inc
> 1700 MLK Way
> Berkeley, CA 94709
> http://www.fastbuzz.com
> http://www.zapatec.com
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message