lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paul Smith <psm...@aconex.com>
Subject [Performanc]
Date Fri, 29 Apr 2005 02:15:33 GMT
[All off latest code from SVN]

Ok, I'm just writing this email because I'm confused about some code in 
Lucene, which will almost certainly highlight a failure of mine to fully 
grasp something probably very basic.  I've had the Lucene in Action book 
for a while, and generally been playing around with Lucene on the side, 
but now we have a serious need for Lucene to index a lot of stuff.

I keep hearing people say "can index millions of docs in a handful of 
minutes", but I can't see this in action, even after fiddling with the 
mergeFactors/maxMergeDocs etc etc. (even tried RAMDirectory batching, 
but it may be quicker, but it's not that much quicker for me).

So, I'm posting my code and analysis, and the hprof trace from it in an 
attempt to have as many people look at my code, laugh, and point out my 
failings.  Please be gentle with me... :)

First, here's my indexing testbed code (please ignore the RAMwriter, I 
know it's not used in this example, but I was trialling things at times):

   private static void createIndex(final Logger LOG, File indexDir) 
throws IOException, SQLException {
       Connection connection;
       IndexWriter writer;
       boolean b = indexDir.mkdirs();
       LOG.info("Index=" + indexDir.getAbsolutePath());
       Analyzer a = new StandardAnalyzer();
       writer = new IndexWriter(indexDir, a, true);
       writer.setMergeFactor(1000);
       writer.setMaxBufferedDocs(10000);
       writer.setUseCompoundFile(false);
       //writer.setInfoStream(System.out);
       connection = DriverManager.getConnection(
               "jdbc:inetdae7:tower.aconex.com?database=mel20050407l", 
"*****",
               "******");
       String sql = "select userid, userfirstname, userlastname, email 
from userx";
       LOG.info("sql=" + sql);
       Statement statement = connection.createStatement();
       statement.setFetchSize(5000);
       LOG.info("Executing sql");
       ResultSet rs = statement.executeQuery(sql);
       LOG.info("ResultSet retrieved");
       int row = 0;
//        final int MAX_ROWS = 100000;
//        SimpleDateFormat df = new SimpleDateFormat("yyyyMMdd");
       RAMDirectory ramDir = new RAMDirectory();
       IndexWriter ramWriter = new IndexWriter(ramDir, a, true);
                   //writer.setUseCompoundFile(false);
      //        ramWriter.setMaxBufferedDocs(10000);
//        ramWriter.setMergeFactor(1000);
       LOG.info("Indexing users");
       long begin = System.currentTimeMillis();
       while (rs.next()) {
           int userid = rs.getInt(1);
           String firstname = rs.getString(2);
           String lastname = rs.getString(3);
           String email = rs.getString(4);
           String fullName = firstname + " " + lastname;
           Document doc = new Document();
           doc.add(Field.Keyword("userid", userid+""));
           doc.add(Field.Keyword("firstname", firstname.toLowerCase()));
           doc.add(Field.Keyword("lastname", lastname.toLowerCase()));
           doc.add(Field.Text("name", fullName.toLowerCase()));
           doc.add(Field.Keyword("email", email.toLowerCase()));
           writer.addDocument(doc);
           row++;
           if((row % 100)==0){
               LOG.info(row + " indexed");
           }
       }
       double end = System.currentTimeMillis();
       double diff = (end-begin)/1000;
       double rate = row/diff;
       LOG.info("rate:" +rate);
       ramWriter.close();
       LOG.info("Finished RAM index, creating file index, " + row + " 
users indexed");
       //writer.addIndexes(new Directory[]{ramDir});
       LOG.info("Optimizing");
       writer.optimize();
       writer.close();
       LOG.info("Completed indexing");
   }

Even changing between Compound and not compound format doesn't really 
change the fact that my Mac Powerbook, as well as  dual Xeon 2.4 Ghz, 
both with at least a GB of RAM are effectively CPU maxed out (only 1 CPU 
on the dual Xeon is maxed out obviously)..

The attached hprof logs from runs using the 
"-Xrunhprof:cpu=samples,file=log.txt,depth=3" syntax highlight that:

  1 48.94% 48.94%    1061   101 
org.apache.lucene.index.IndexWriter.maybeMergeSegments

This reads to me that it's spending a LOT of time in this method.  
Looking at the current code in SVN, and debugging through, I see that as 
each Document is added to this IndexWriter, the loop in 
maybeMergeSegments gets called, and the loop spends a busy time looping 
through each SegmentInfo instance.  In my case, it seems that there is 
only one document in each SegmentInfo .  This means that each time a new 
document is added, this loop gets bigger and bigger.  Perhaps someone 
can explain why SegmentInfo only contains 1 doc during this time?

Now I'm sure there is a good reason for this loop, but I'm just 
highlighting that this seems to be the biggest CPU consumer of the 
indexing operation as I see it.  (again, it could just be my test setup).

see also log.log which is the output of log4j as each 100 items is added 
(ignore the "Finished RAM indexing.." log, you can see from above it's 
not using the RAMWriter in this case.

Please, observations about why my testing setup is not accurate very 
welcome.  I just want to understand why I'm CPU bound.  At this rate I'm 
only getting on average 300-400-ish items/second added to the index.  Of 
course a faster CPU is always a good idea, but if there _is_ a way to 
optimize this area, one could see a significant indexing speed increase? 
(move the bottleneck back to IO I guess).

cheers,

Paul Smith



Mime
View raw message