lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Doron Cohen (JIRA)" <j...@apache.org>
Subject [jira] Commented: (LUCENE-388) [PATCH] IndexWriter.maybeMergeSegments() takes lots of CPU resources
Date Sun, 20 Aug 2006 07:03:16 GMT
    [ http://issues.apache.org/jira/browse/LUCENE-388?page=comments#action_12429248 ] 
            
Doron Cohen commented on LUCENE-388:
------------------------------------

Paul, would you like to re-open this issue for (re)solving it with one of the two recent patches
(2 or 2b)  - I think that once an issue is resolved it can be re-opened (or closed) by the
developer who opened it. (I assume commiters can also re-open, but I'm not sure - well, at
least I can't.)
Thanks, Doron

> [PATCH] IndexWriter.maybeMergeSegments() takes lots of CPU resources
> --------------------------------------------------------------------
>
>                 Key: LUCENE-388
>                 URL: http://issues.apache.org/jira/browse/LUCENE-388
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Index
>    Affects Versions: CVS Nightly - Specify date in submission
>         Environment: Operating System: Mac OS X 10.3
> Platform: Macintosh
>            Reporter: Paul Smith
>         Assigned To: Yonik Seeley
>             Fix For: 2.0.1
>
>         Attachments: doron_2_IndexWriter.patch, doron_2b_IndexWriter.patch, doron_IndexWriter.patch,
IndexWriter.patch, log-compound.txt, log.optimized.deep.txt, log.optimized.txt, Lucene Performance
Test - with & without hack.xls, lucene.34930.patch, yonik_indexwriter.diff, yonik_indexwriter.diff
>
>
> Note: I believe this to be the same situation with 1.4.3 as with SVN HEAD.
> Analysis using hprof utility shows that during index creation with many
> documents highlights that the CPU spends a large portion of it's time in
> IndexWriter.maybeMergeSegments(), which seems to be a 'waste' compared with
> other valuable CPU intensive operations such as tokenization etc.
> Using the following test snippet to retrieve some rows from the db and create an
> index:
>         Analyzer a = new StandardAnalyzer();
>         writer = new IndexWriter(indexDir, a, true);
>         writer.setMergeFactor(1000);
>         writer.setMaxBufferedDocs(10000);
>         writer.setUseCompoundFile(false);
>         connection = DriverManager.getConnection(
>                 "jdbc:inetdae7:tower.aconex.com?database=<somedb>", "secret",
>                 "squirrel");
>         String sql = "select userid, userfirstname, userlastname, email from userx";
>         LOG.info("sql=" + sql);
>         Statement statement = connection.createStatement();
>         statement.setFetchSize(5000);
>         LOG.info("Executing sql");
>         ResultSet rs = statement.executeQuery(sql);
>         LOG.info("ResultSet retrieved");
>         int row = 0;
>         LOG.info("Indexing users");
>         long begin = System.currentTimeMillis();
>         while (rs.next()) {
>             int userid = rs.getInt(1);
>             String firstname = rs.getString(2);
>             String lastname = rs.getString(3);
>             String email = rs.getString(4);
>             String fullName = firstname + " " + lastname;
>             Document doc = new Document();
>             doc.add(Field.Keyword("userid", userid+""));
>             doc.add(Field.Keyword("firstname", firstname.toLowerCase()));
>             doc.add(Field.Keyword("lastname", lastname.toLowerCase()));
>             doc.add(Field.Text("name", fullName.toLowerCase()));
>             doc.add(Field.Keyword("email", email.toLowerCase()));
>             writer.addDocument(doc);
>             row++;
>             if((row % 100)==0){
>                 LOG.info(row + " indexed");
>             }
>         }
>         double end = System.currentTimeMillis();
>         double diff = (end-begin)/1000;
>         double rate = row/diff;
>         LOG.info("rate:" +rate);
> On my 1.5GHz PowerBook with 1.5Gb RAM and a 5400 RPM drive, my CPU is maxed out,
> and I end up getting a rate of indexing between 490-515 documents/second run
> over 10 times in succession.  
> By applying a simple patch to IndexWriter (see attached shortly), which defers
> the calling of maybeMergeSegments() so that it is only called every 2000
> times(an arbitrary figure), I appear to get a new rate of between 945-970
> documents/second.  Using Luke to look inside each index created between these 2
> there does not appear to be any difference.  Same number of Documents, same
> number of Terms.
> I'm not suggesting one should apply this patch, I'm just highlighting the
> difference in performance that this sort of change gives you.  
> We are about to use Lucene to index 4 million construction document records, and
> so speeding up the indexing process is in our best interest! :)  If one
> considers the amount of CPU time spent in maybeMergeSegments over the initial
> index creation of 4 million documents, I think one could see how it would be
> ideal to try to speed this area up (at least move the bottleneck to IO). 
> I woul appreciate anyone taking a moment to comment on this.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message