Return-Path: Delivered-To: apmail-lucene-java-dev-archive@www.apache.org Received: (qmail 96456 invoked from network); 18 Aug 2006 13:38:26 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (209.237.227.199) by minotaur.apache.org with SMTP; 18 Aug 2006 13:38:26 -0000 Received: (qmail 10801 invoked by uid 500); 18 Aug 2006 13:38:23 -0000 Delivered-To: apmail-lucene-java-dev-archive@lucene.apache.org Received: (qmail 10740 invoked by uid 500); 18 Aug 2006 13:38:22 -0000 Mailing-List: contact java-dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-dev@lucene.apache.org Delivered-To: mailing list java-dev@lucene.apache.org Received: (qmail 10721 invoked by uid 99); 18 Aug 2006 13:38:22 -0000 Received: from asf.osuosl.org (HELO asf.osuosl.org) (140.211.166.49) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 18 Aug 2006 06:38:22 -0700 X-ASF-Spam-Status: No, hits=0.8 required=10.0 tests=INFO_TLD X-Spam-Check-By: apache.org Received: from [209.237.227.198] (HELO brutus.apache.org) (209.237.227.198) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 18 Aug 2006 06:38:21 -0700 Received: from brutus (localhost [127.0.0.1]) by brutus.apache.org (Postfix) with ESMTP id 9B444410020 for ; Fri, 18 Aug 2006 13:35:22 +0000 (GMT) Message-ID: <33462118.1155908122633.JavaMail.jira@brutus> Date: Fri, 18 Aug 2006 06:35:22 -0700 (PDT) From: "Yonik Seeley (JIRA)" To: java-dev@lucene.apache.org Subject: [jira] Commented: (LUCENE-388) [PATCH] IndexWriter.maybeMergeSegments() takes lots of CPU resources MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org X-Spam-Rating: minotaur.apache.org 1.6.2 0/1000/N [ http://issues.apache.org/jira/browse/LUCENE-388?page=comments#action_12429012 ] Yonik Seeley commented on LUCENE-388: ------------------------------------- Thanks Doron, I caught that too and I was just going to set the count to 0 in mergeSegments (mergeSegments is always called with end == size() currently I think). Your fix is better though - gives more flexibility. > [PATCH] IndexWriter.maybeMergeSegments() takes lots of CPU resources > -------------------------------------------------------------------- > > Key: LUCENE-388 > URL: http://issues.apache.org/jira/browse/LUCENE-388 > Project: Lucene - Java > Issue Type: Bug > Components: Index > Affects Versions: CVS Nightly - Specify date in submission > Environment: Operating System: Mac OS X 10.3 > Platform: Macintosh > Reporter: Paul Smith > Assigned To: Yonik Seeley > Fix For: 2.0.1 > > Attachments: doron_2_IndexWriter.patch, doron_IndexWriter.patch, IndexWriter.patch, log-compound.txt, log.optimized.deep.txt, log.optimized.txt, Lucene Performance Test - with & without hack.xls, lucene.34930.patch, yonik_indexwriter.diff, yonik_indexwriter.diff > > > Note: I believe this to be the same situation with 1.4.3 as with SVN HEAD. > Analysis using hprof utility shows that during index creation with many > documents highlights that the CPU spends a large portion of it's time in > IndexWriter.maybeMergeSegments(), which seems to be a 'waste' compared with > other valuable CPU intensive operations such as tokenization etc. > Using the following test snippet to retrieve some rows from the db and create an > index: > Analyzer a = new StandardAnalyzer(); > writer = new IndexWriter(indexDir, a, true); > writer.setMergeFactor(1000); > writer.setMaxBufferedDocs(10000); > writer.setUseCompoundFile(false); > connection = DriverManager.getConnection( > "jdbc:inetdae7:tower.aconex.com?database=", "secret", > "squirrel"); > String sql = "select userid, userfirstname, userlastname, email from userx"; > LOG.info("sql=" + sql); > Statement statement = connection.createStatement(); > statement.setFetchSize(5000); > LOG.info("Executing sql"); > ResultSet rs = statement.executeQuery(sql); > LOG.info("ResultSet retrieved"); > int row = 0; > LOG.info("Indexing users"); > long begin = System.currentTimeMillis(); > while (rs.next()) { > int userid = rs.getInt(1); > String firstname = rs.getString(2); > String lastname = rs.getString(3); > String email = rs.getString(4); > String fullName = firstname + " " + lastname; > Document doc = new Document(); > doc.add(Field.Keyword("userid", userid+"")); > doc.add(Field.Keyword("firstname", firstname.toLowerCase())); > doc.add(Field.Keyword("lastname", lastname.toLowerCase())); > doc.add(Field.Text("name", fullName.toLowerCase())); > doc.add(Field.Keyword("email", email.toLowerCase())); > writer.addDocument(doc); > row++; > if((row % 100)==0){ > LOG.info(row + " indexed"); > } > } > double end = System.currentTimeMillis(); > double diff = (end-begin)/1000; > double rate = row/diff; > LOG.info("rate:" +rate); > On my 1.5GHz PowerBook with 1.5Gb RAM and a 5400 RPM drive, my CPU is maxed out, > and I end up getting a rate of indexing between 490-515 documents/second run > over 10 times in succession. > By applying a simple patch to IndexWriter (see attached shortly), which defers > the calling of maybeMergeSegments() so that it is only called every 2000 > times(an arbitrary figure), I appear to get a new rate of between 945-970 > documents/second. Using Luke to look inside each index created between these 2 > there does not appear to be any difference. Same number of Documents, same > number of Terms. > I'm not suggesting one should apply this patch, I'm just highlighting the > difference in performance that this sort of change gives you. > We are about to use Lucene to index 4 million construction document records, and > so speeding up the indexing process is in our best interest! :) If one > considers the amount of CPU time spent in maybeMergeSegments over the initial > index creation of 4 million documents, I think one could see how it would be > ideal to try to speed this area up (at least move the bottleneck to IO). > I woul appreciate anyone taking a moment to comment on this. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org For additional commands, e-mail: java-dev-help@lucene.apache.org