lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jochen Frey" <>
Subject Benchmark (WAS: Indexing Speed: Documents vs. Sentences)
Date Fri, 19 Dec 2003 17:00:23 GMT

	Here's is a benchmark. I am not sure if that is proper etiquette,
but I will just paste it into this mail and hope that it gets funneled into
the right channels.


  <b>Hardware Environment</b><br/>
  <li><i>Dedicated machine for indexing</i>no, some other work performed
it. shouldn't influence results much since it's a multiple processor
  <li><i>CPU</i>2x Intel Xeon 3.05GHz</li>
  <li><i>Drive configuration</i>SCSI</li>
  <b>Software environment</b><br/>
  <li><i>Java Version</i>1.4.2-b28</li>
  <li><i>Java VM</i>Java HotSpot Client VM 1.4.2</li>
  <li><i>OS Version</i>Redhat 8</li>
  <li><i>Location of index</i>local</li>
  <b>Lucene indexing variables</b><br/>
  <li><i>Number of source documents</i>5,000,000</li>
  <li><i>Total filesize of source documents</i>40GB</li>
  <li><i>Average filesize of source documents</i>8kB</li>
  <li><i>Source documents storage location</i>DB on remote server</li>
  <li><i>File type of source documents</i>pre-parsed HTML</li>
  <li><i>Parser(s) used, if any</i>n/a</li>
  <li><i>Analyzer(s) used</i>StandardAnalyzer</li>
  <li><i>Number of fields per document</i>5</li>
  <li><i>Type of fields</i>actual text is indexed but not stored in lucene
  <li><i>Index persistence</i>: Where the index is stored, e.g. 
FSDirectory, SqlDirectory, etc</li>
  <li><i>Time taken (in ms/s as an average of at least 3 indexing 
runs)</i>332 minutes</li>
  <li><i>Time taken / 1000 docs indexed</i>4 sec</li>
  <li><i>Memory consumption</i>about 100MB</li>
  <li><i>Notes</i>With the above configuration we pretty consistently
achieve a 250 docs / sec rate
  of indexing. The actual text cannot be retrieved from the index, this
keeps the index
   size down (6.1GB) and increases indexing speed. When the actual documents
are stored in the index
  the rate drops by about 30% to 160 docs / sec.</li>

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message