lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doug Cutting <cutt...@apache.org>
Subject Re: Benchmarkers
Date Mon, 03 Apr 2006 17:36:53 GMT
Marvin Humphrey wrote:
>     IndexWriter writer = new IndexWriter(indexDir,
>       new WhitespaceAnalyzer(), true);

Please make sure that analyzers are comparable between the various 
engines you benchmark.  WhitespaceAnalyzer is efficient, but results in 
far more tokens and terms than, e.g., StopAnalyzer (alphabetic character 
sequences, lowercased, with a 35-word English stop list).  Since tokens 
and terms are the atoms and elements of indexing, their counts are a 
dominant factor in performance.

Increasing IndexWriter.setMaxBufferedDocs(100) or more will increase 
indexing speed by using more Java heap.  10 is the default. 
IndexWriter.setUseCompoundFile(false) will also increase indexing speed. 
  I don't think increasing IndexWriter.setMergeFactor() should help 
much. and advise staying with the default (10).  Folks used to set this 
as a surrogate for setMaxBufferedDocs before that was a separate paramter.

You may need to specify a larger Java heap, with something like 
-Xmx500M.  The default is around 64MB.  Also, the -server option is 
almost always faster with Sun's JVM.  Sun's 1.5 JVM is faster than their 
1.4 JVM.  I think IBMs JVM may be generally faster for indexing.  The 
last I checked, one was fasteer for indexing and the other for 
searching, but I'm not certain which was which.

>   private Document nextDoc(File f) throws Exception {
>     // the title is the first line, the body is the rest
>     BufferedReader br = new BufferedReader(new FileReader(f));
>     String title;
>     if ( (title = br.readLine()) == null)
>       throw new Exception("Failed to read title");
>     StringBuffer buf = new StringBuffer();
>     String str;
>     while ( (str = br.readLine()) != null )
>       buf.append( str );
>     br.close();
>     String body = buf.toString();
> 
>     // add title and body to doc
>     Document doc = new Document();
>     Field titleField = new Field("title", title,
>       Field.Store.YES, Field.Index.TOKENIZED, Field.TermVector.NO);
>     Field bodyField = new Field("body", body,
>       Field.Store.YES, Field.Index.TOKENIZED, Field.TermVector.NO);
>     doc.add(titleField);
>     doc.add(bodyField);

You can avoid some buffering by passing a Reader for the body text:

   Field bodyField = new Field("body", br,
     Field.Store.YES, Field.Index.TOKENIZED, Field.TermVector.NO);

The only rub is that you'll have to make sure that the FileReader is 
closed.  So you could rewrite this method to be something like:

   private void indexFile(File f, IndexWriter writer) {
     BufferedReader br = new BufferedReader(new FileReader(f));
     try {
       Document doc = new Document();

       ... read title from br and add it to doc ...

       Field bodyField = new Field("body", br,
         Field.Store.YES, Field.Index.TOKENIZED, Field.TermVector.NO);
       doc.add(bodyField);

       writer.addDocument(doc);
     } finally {
       br.close();
     }
   }

Does that make sense?

Finally, I question your use of Field.Store.YES.  Do you really to use 
Lucene to store the full content of your documents?  Are you asking this 
of the other engines?

Cheers,

Doug



---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message