lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Doron Cohen (JIRA)" <j...@apache.org>
Subject [jira] Updated: (LUCENE-388) [PATCH] IndexWriter.maybeMergeSegments() takes lots of CPU resources
Date Wed, 16 Aug 2006 22:07:16 GMT
     [ http://issues.apache.org/jira/browse/LUCENE-388?page=all ]

Doron Cohen updated LUCENE-388:
-------------------------------

    Attachment: doron_IndexWriter.patch

It seems that the excessive cpu usage is mainly for (re)scanning those single-doc segments
at the top of the "stack". 

The following simple patch (small modification to previous one) only keeps track of the number
of single doc segments, and when iterating for merge candidate segments, takes all those single
doc segments in one step. 

All tests pass with this patch, running a bit faster (3 seconds - can be just noise). A "tight
mem setting" (2048 cacm files with -Xmx3m and setMaxBufferedDocs(50) ) that fails with previous
patch passes with this one. I checked the trace for which segments were merged - same merge
decisions as before this change for these 2048 filess.

> [PATCH] IndexWriter.maybeMergeSegments() takes lots of CPU resources
> --------------------------------------------------------------------
>
>                 Key: LUCENE-388
>                 URL: http://issues.apache.org/jira/browse/LUCENE-388
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Index
>    Affects Versions: CVS Nightly - Specify date in submission
>         Environment: Operating System: Mac OS X 10.3
> Platform: Macintosh
>            Reporter: Paul Smith
>         Assigned To: Yonik Seeley
>         Attachments: doron_IndexWriter.patch, IndexWriter.patch, log-compound.txt, log.optimized.deep.txt,
log.optimized.txt, Lucene Performance Test - with & without hack.xls, lucene.34930.patch,
yonik_indexwriter.diff, yonik_indexwriter.diff
>
>
> Note: I believe this to be the same situation with 1.4.3 as with SVN HEAD.
> Analysis using hprof utility shows that during index creation with many
> documents highlights that the CPU spends a large portion of it's time in
> IndexWriter.maybeMergeSegments(), which seems to be a 'waste' compared with
> other valuable CPU intensive operations such as tokenization etc.
> Using the following test snippet to retrieve some rows from the db and create an
> index:
>         Analyzer a = new StandardAnalyzer();
>         writer = new IndexWriter(indexDir, a, true);
>         writer.setMergeFactor(1000);
>         writer.setMaxBufferedDocs(10000);
>         writer.setUseCompoundFile(false);
>         connection = DriverManager.getConnection(
>                 "jdbc:inetdae7:tower.aconex.com?database=<somedb>", "secret",
>                 "squirrel");
>         String sql = "select userid, userfirstname, userlastname, email from userx";
>         LOG.info("sql=" + sql);
>         Statement statement = connection.createStatement();
>         statement.setFetchSize(5000);
>         LOG.info("Executing sql");
>         ResultSet rs = statement.executeQuery(sql);
>         LOG.info("ResultSet retrieved");
>         int row = 0;
>         LOG.info("Indexing users");
>         long begin = System.currentTimeMillis();
>         while (rs.next()) {
>             int userid = rs.getInt(1);
>             String firstname = rs.getString(2);
>             String lastname = rs.getString(3);
>             String email = rs.getString(4);
>             String fullName = firstname + " " + lastname;
>             Document doc = new Document();
>             doc.add(Field.Keyword("userid", userid+""));
>             doc.add(Field.Keyword("firstname", firstname.toLowerCase()));
>             doc.add(Field.Keyword("lastname", lastname.toLowerCase()));
>             doc.add(Field.Text("name", fullName.toLowerCase()));
>             doc.add(Field.Keyword("email", email.toLowerCase()));
>             writer.addDocument(doc);
>             row++;
>             if((row % 100)==0){
>                 LOG.info(row + " indexed");
>             }
>         }
>         double end = System.currentTimeMillis();
>         double diff = (end-begin)/1000;
>         double rate = row/diff;
>         LOG.info("rate:" +rate);
> On my 1.5GHz PowerBook with 1.5Gb RAM and a 5400 RPM drive, my CPU is maxed out,
> and I end up getting a rate of indexing between 490-515 documents/second run
> over 10 times in succession.  
> By applying a simple patch to IndexWriter (see attached shortly), which defers
> the calling of maybeMergeSegments() so that it is only called every 2000
> times(an arbitrary figure), I appear to get a new rate of between 945-970
> documents/second.  Using Luke to look inside each index created between these 2
> there does not appear to be any difference.  Same number of Documents, same
> number of Terms.
> I'm not suggesting one should apply this patch, I'm just highlighting the
> difference in performance that this sort of change gives you.  
> We are about to use Lucene to index 4 million construction document records, and
> so speeding up the indexing process is in our best interest! :)  If one
> considers the amount of CPU time spent in maybeMergeSegments over the initial
> index creation of 4 million documents, I think one could see how it would be
> ideal to try to speed this area up (at least move the bottleneck to IO). 
> I woul appreciate anyone taking a moment to comment on this.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message