lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paul Smith <psm...@aconex.com>
Subject [Performance]: IndexWriter again...
Date Mon, 16 May 2005 06:15:08 GMT
Ok, I'm just following up on my email from 29th April titled  
'[Performanc]'  (don't you love it when you send before you've typed  
your subject line completely).  The thread is here:

http://mail-archives.apache.org/mod_mbox/lucene-java-dev/200504.mbox/% 
3C427198C5.5040408@aconex.com%3E

In summary, I still firmly believe that the  
IndexWriter.maybeMergeSegments() is chewing a lot more CPU than would  
be ideal.  So I ran a simple test.  I ran the same test I've done  
before, using mergeFactor(1000) maxBufferedDocs(10000), useCompondFile 
(false), indexing 5 fields (user first/lastname/email address)

As a baseline using the latest SVN source code, I'm getting an  
indexing rate of between 490-515 items/second of a number of runs.

By applying the attached simple patch to IndexWriter, I'm getting  
between 945-970 of a number of test runs.  That's a significant speed  
up.  All the patch is doing is deferring the call to  
maybeMergeSegments so it only does it every 2000 iterations (2000 is  
totally arbitrary on my part).

I've verified with Luke that the index generated contains the same #  
documents, and same # terms, but I have not had a chance to properly  
setup my local environment to run the test cases.

Obviously the attached patch is a dirty hack of the highest order. In  
my case I'm re-indexing from scratch every time, so there may be a  
reason why we shouldn't be doing this sort of deferring of method  
calls.  Perhaps the source code is optimized around incremental/batch  
updates to _existing_ indexes, but creating a new index, but with a  
penalty of creating a new index performs slower than one would like.

Perhaps IndexWriter could benefit from another setting that lets one  
configure how often to call maybeMergeSegments()?  That could of  
course confuse more people than it helps.

I would really appreciate anyones thoughts on this, I'll be very  
happy to be proven wrong because it will just help me understand more  
of Lucene.  I would hope that speeding up indexing would benefit  
everyone?  Particularly the large scale sites out there.

cheers,

Paul Smith




Mime
View raw message