lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marvin Humphrey <mar...@rectangular.com>
Subject Re: Benchmarkers
Date Tue, 04 Apr 2006 00:02:14 GMT

On Apr 3, 2006, at 10:36 AM, Doug Cutting wrote:

> Marvin Humphrey wrote:
>>     IndexWriter writer = new IndexWriter(indexDir,
>>       new WhitespaceAnalyzer(), true);
>
> Please make sure that analyzers are comparable between the various  
> engines you benchmark.  WhitespaceAnalyzer is efficient, but  
> results in far more tokens and terms than, e.g., StopAnalyzer  
> (alphabetic character sequences, lowercased, with a 35-word English  
> stop list).

They're all using WhitespaceAnalyzer or the equivalent.  KinoSearch  
doesn't offer that class per se, but its Tokenizer class allows you  
to specify an arbitrary regex matching one token.

     # a WhitespaceAnalyzer in KinoSearch
     my $tokenizer = KinoSearch::Analysis::Tokenizer->new(
         token_re => qr/\S+/,
     );

> Since tokens and terms are the atoms and elements of indexing,  
> their counts are a dominant factor in performance.
>
> Increasing IndexWriter.setMaxBufferedDocs(100) or more will  
> increase indexing speed by using more Java heap.  10 is the  
> default. IndexWriter.setUseCompoundFile(false) will also increase  
> indexing speed.  I don't think increasing IndexWriter.setMergeFactor 
> () should help much. and advise staying with the default (10).   
> Folks used to set this as a surrogate for setMaxBufferedDocs before  
> that was a separate paramter.

I'm addressing these issues in my reply to Yonik.
>
> You may need to specify a larger Java heap, with something like - 
> Xmx500M.  The default is around 64MB.

Great, I'll use -Xmx500M.

> Also, the -server option is almost always faster with Sun's JVM.   
> Sun's 1.5 JVM is faster than their 1.4 JVM.  I think IBMs JVM may  
> be generally faster for indexing.  The last I checked, one was  
> fasteer for indexing and the other for searching, but I'm not  
> certain which was which.

I'm running these on my G4 laptop.

>
>>   private Document nextDoc(File f) throws Exception {
>>     // the title is the first line, the body is the rest
>>     BufferedReader br = new BufferedReader(new FileReader(f));
>>     String title;
>>     if ( (title = br.readLine()) == null)
>>       throw new Exception("Failed to read title");
>>     StringBuffer buf = new StringBuffer();
>>     String str;
>>     while ( (str = br.readLine()) != null )
>>       buf.append( str );
>>     br.close();
>>     String body = buf.toString();
>>     // add title and body to doc
>>     Document doc = new Document();
>>     Field titleField = new Field("title", title,
>>       Field.Store.YES, Field.Index.TOKENIZED, Field.TermVector.NO);
>>     Field bodyField = new Field("body", body,
>>       Field.Store.YES, Field.Index.TOKENIZED, Field.TermVector.NO);
>>     doc.add(titleField);
>>     doc.add(bodyField);
>
> You can avoid some buffering by passing a Reader for the body text:
>
>   Field bodyField = new Field("body", br,
>     Field.Store.YES, Field.Index.TOKENIZED, Field.TermVector.NO);
>
> The only rub is that you'll have to make sure that the FileReader  
> is closed.  So you could rewrite this method to be something like:
>
>   private void indexFile(File f, IndexWriter writer) {
>     BufferedReader br = new BufferedReader(new FileReader(f));
>     try {
>       Document doc = new Document();
>
>       ... read title from br and add it to doc ...
>
>       Field bodyField = new Field("body", br,
>         Field.Store.YES, Field.Index.TOKENIZED, Field.TermVector.NO);
>       doc.add(bodyField);
>
>       writer.addDocument(doc);
>     } finally {
>       br.close();
>     }
>   }
>
> Does that make sense?

It does.  However, there's no constructor for Field which allows  
Field.Store.YES, but uses a Reader instead of a String.

> Finally, I question your use of Field.Store.YES.  Do you really to  
> use Lucene to store the full content of your documents?

Yes.  By default, all fields in KinoSearch are analyzed, stored, and  
vectorized (with positions and offsets).  This allows use of the  
Highlighter with minimum fuss.  Savvy users looking to shrink the  
size of their indexes can override those defaults.

I'd originally omitted TermVectors from the benchmarking apps because  
Plucene doesn't have them.  But having KinoSearch and Lucene generate  
them isn't going to slow them down enough that Plucene will become  
competitive.  It makes sense to generate two result sets, one with  
the body stored and vectored, and one with the body neither stored  
nor vectored.  I'll use the Reader constructor for Lucene's unstored  
version.

>   Are you asking this of the other engines?

They were all even before.  Now Plucene will have a slight advantage  
in one config.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message