lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Yonik Seeley (JIRA)" <j...@apache.org>
Subject [jira] Commented: (LUCENE-388) [PATCH] IndexWriter.maybeMergeSegments() takes lots of CPU resources
Date Wed, 16 Aug 2006 22:33:16 GMT
    [ http://issues.apache.org/jira/browse/LUCENE-388?page=comments#action_12428527 ] 
            
Yonik Seeley commented on LUCENE-388:
-------------------------------------

I was literally a minute away from committing my version when Doron sumbitted his  ;-)
Actually, I think I like Doron's "singleDocSegmentsCount" better.... it's easier to understand
at a glance.

I was testing the performance for mine... not as much of a speeup as I would have liked...
5 to 6% better with maxBufferedDocs=1000, and a trivial single field document.
You need to go to maxBufferedDocs=10000 to see a good speedup, and that's probably not advisable
for most real indicies (and the maxBufferedDocs=1000 used much less memory and was slightly
faster anyway).

Here is the code I added to IndexWriter to test my version (add testInvariants() after add()
call and after flushRamSegments() in close(), then do "ant test")

  private synchronized void testInvariants() {
    // index segments should decrease in size
    int maxSegLevel = 0;
    for (int i=segmentInfos.size()-1; i>=0; i--) {
      SegmentInfo si = segmentInfos.info(i);
      int segLevel = (si.docCount)/minMergeDocs;
      if (segLevel < maxSegLevel) {

        throw new RuntimeException("Segment #" + i + " is too small. " + segInfo());
      }
      maxSegLevel = Math.max(maxSegLevel,segLevel);
    }

    // check if merges needed
    long targetMergeDocs = minMergeDocs;
    int minSegment = segmentInfos.size();

    while (targetMergeDocs <= maxMergeDocs && minSegment>=0) {
      int mergeDocs = 0;
      while (--minSegment >= 0) {
        SegmentInfo si = segmentInfos.info(minSegment);
        if (si.docCount >= targetMergeDocs) break;
        mergeDocs += si.docCount;
      }

      if (mergeDocs >= targetMergeDocs) {
        throw new RuntimeException("Merge needed at level "+targetMergeDocs + " :"+segInfo());
      }

      targetMergeDocs *= mergeFactor;		  // increase target size
    }
  }

  private String segInfo() {
    StringBuffer sb = new StringBuffer("minMergeDocs="+minMergeDocs+"docsLeftBeforeMerge="+docsLeftBeforeMerge+"
segsizes:");
    for (int i=0; i<segmentInfos.size(); i++) {
      sb.append(segmentInfos.info(i).docCount);
      sb.append(",");
    }
    return sb.toString();
  }


> [PATCH] IndexWriter.maybeMergeSegments() takes lots of CPU resources
> --------------------------------------------------------------------
>
>                 Key: LUCENE-388
>                 URL: http://issues.apache.org/jira/browse/LUCENE-388
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Index
>    Affects Versions: CVS Nightly - Specify date in submission
>         Environment: Operating System: Mac OS X 10.3
> Platform: Macintosh
>            Reporter: Paul Smith
>         Assigned To: Yonik Seeley
>         Attachments: doron_IndexWriter.patch, IndexWriter.patch, log-compound.txt, log.optimized.deep.txt,
log.optimized.txt, Lucene Performance Test - with & without hack.xls, lucene.34930.patch,
yonik_indexwriter.diff, yonik_indexwriter.diff
>
>
> Note: I believe this to be the same situation with 1.4.3 as with SVN HEAD.
> Analysis using hprof utility shows that during index creation with many
> documents highlights that the CPU spends a large portion of it's time in
> IndexWriter.maybeMergeSegments(), which seems to be a 'waste' compared with
> other valuable CPU intensive operations such as tokenization etc.
> Using the following test snippet to retrieve some rows from the db and create an
> index:
>         Analyzer a = new StandardAnalyzer();
>         writer = new IndexWriter(indexDir, a, true);
>         writer.setMergeFactor(1000);
>         writer.setMaxBufferedDocs(10000);
>         writer.setUseCompoundFile(false);
>         connection = DriverManager.getConnection(
>                 "jdbc:inetdae7:tower.aconex.com?database=<somedb>", "secret",
>                 "squirrel");
>         String sql = "select userid, userfirstname, userlastname, email from userx";
>         LOG.info("sql=" + sql);
>         Statement statement = connection.createStatement();
>         statement.setFetchSize(5000);
>         LOG.info("Executing sql");
>         ResultSet rs = statement.executeQuery(sql);
>         LOG.info("ResultSet retrieved");
>         int row = 0;
>         LOG.info("Indexing users");
>         long begin = System.currentTimeMillis();
>         while (rs.next()) {
>             int userid = rs.getInt(1);
>             String firstname = rs.getString(2);
>             String lastname = rs.getString(3);
>             String email = rs.getString(4);
>             String fullName = firstname + " " + lastname;
>             Document doc = new Document();
>             doc.add(Field.Keyword("userid", userid+""));
>             doc.add(Field.Keyword("firstname", firstname.toLowerCase()));
>             doc.add(Field.Keyword("lastname", lastname.toLowerCase()));
>             doc.add(Field.Text("name", fullName.toLowerCase()));
>             doc.add(Field.Keyword("email", email.toLowerCase()));
>             writer.addDocument(doc);
>             row++;
>             if((row % 100)==0){
>                 LOG.info(row + " indexed");
>             }
>         }
>         double end = System.currentTimeMillis();
>         double diff = (end-begin)/1000;
>         double rate = row/diff;
>         LOG.info("rate:" +rate);
> On my 1.5GHz PowerBook with 1.5Gb RAM and a 5400 RPM drive, my CPU is maxed out,
> and I end up getting a rate of indexing between 490-515 documents/second run
> over 10 times in succession.  
> By applying a simple patch to IndexWriter (see attached shortly), which defers
> the calling of maybeMergeSegments() so that it is only called every 2000
> times(an arbitrary figure), I appear to get a new rate of between 945-970
> documents/second.  Using Luke to look inside each index created between these 2
> there does not appear to be any difference.  Same number of Documents, same
> number of Terms.
> I'm not suggesting one should apply this patch, I'm just highlighting the
> difference in performance that this sort of change gives you.  
> We are about to use Lucene to index 4 million construction document records, and
> so speeding up the indexing process is in our best interest! :)  If one
> considers the amount of CPU time spent in maybeMergeSegments over the initial
> index creation of 4 million documents, I think one could see how it would be
> ideal to try to speed this area up (at least move the bottleneck to IO). 
> I woul appreciate anyone taking a moment to comment on this.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message