lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From o...@apache.org
Subject cvs commit: jakarta-lucene/docs benchmarks.html benchmarktemplate.xml
Date Wed, 04 Dec 2002 05:46:45 GMT
otis        2002/12/03 21:46:44

  Added:       xdocs    benchmarks.xml
               docs     benchmarks.html benchmarktemplate.xml
  Log:
  - User-submitted benchmarks and a template.
  
  Submitted by:	Kelvin Tan
  Reviewed by:	 otis
  
  Revision  Changes    Path
  1.1                  jakarta-lucene/xdocs/benchmarks.xml
  
  Index: benchmarks.xml
  ===================================================================
  <?xml version="1.0"?>
  <document>
      <properties>
        <author email="kelvint@apache.org">Kelvin Tan</author>
        <title>Resources - Performance Benchmarks</title>
      </properties>
      <body>
  
        <section name="Performance Benchmarks">
        <p>
        The purpose of these user-submitted performance figures is to 
  give current and potential users of Lucene a sense 
        of how well Lucene scales. If the requirements for an upcoming 
  project is similar to an existing benchmark, you 
        will also have something to work with when designing the system 
  architecture for the application.
        </p>
        <p>
        If you've conducted performance tests with Lucene, we'd 
  appreciate if you can submit these figures for display 
        on this page. Post these figures to the lucene-user mailing list 
  using this 
        <a href="benchmarktemplate.xml">template</a>.
        </p>
        </section>
        
        <section name="Benchmark Variables">
        <p>
        <ul>
        <p>
        <b>Hardware Environment</b><br/>
        <li><i>Dedicated machine for indexing</i>: Self-explanatory 
  (yes/no)</li>
        <li><i>CPU</i>: Self-explanatory (Type, Speed and Quantity)</li>
        <li><i>RAM</i>: Self-explanatory</li>
        <li><i>Drive configuration</i>: Self-explanatory (IDE, SCSI, 
  RAID-1, RAID-5)</li>
        </p>
        <p>
        <b>Software environment</b><br/>
        <li><i>Java Version</i>: Version of Java SDK/JRE that is run 
  </li>
        <li><i>Java VM</i>: Server/client VM, Sun VM/JRockIt</li>
        <li><i>OS Version</i>: Self-explanatory</li>
        <li><i>Location of index</i>: Is the index stored in filesystem

  or database? Is it on the same server(local) or 
        over the network?</li>
        </p>
        <p>
        <b>Lucene indexing variables</b><br/>
        <li><i>Number of source documents</i>: Number of documents being

  indexed</li>
        <li><i>Total filesize of source documents</i>: 
  Self-explanatory</li>
        <li><i>Average filesize of source documents</i>: 
  Self-explanatory</li>
        <li><i>Source documents storage location</i>: Where are the 
  documents being indexed located? 
          Filesystem, DB, http,etc</li>
        <li><i>File type of source documents</i>: Types of files being 
  indexed, e.g. HTML files, XML files, PDF files, etc.</li>
        <li><i>Parser(s) used, if any</i>: Parsers used for parsing the

  various files for indexing, 
          e.g. XML parser, HTML parser, etc.</li>
        <li><i>Analyzer(s) used</i>: Type of Lucene analyzer used</li>
        <li><i>Number of fields per document</i>: Number of Fields each

  Document contains</li>
        <li><i>Type of fields</i>: Type of each field</li>
        <li><i>Index persistence</i>: Where the index is stored, e.g. 
  FSDirectory, SqlDirectory, etc</li>
        </p>
        <p>
        <b>Figures</b><br/>
        <li><i>Time taken (in ms/s as an average of at least 3 indexing 
  runs)</i>: Time taken to index all files</li>
        <li><i>Time taken / 1000 docs indexed</i>: Time taken to index 
  1000 files</li>
        <li><i>Memory consumption</i>: Self-explanatory</li>
        </p>
        <p>
        <b>Notes</b><br/>
        <li><i>Notes</i>: Any comments which don't belong in the above,

  special tuning/strategies, etc</li>
        </p>
        </ul>
        </p>
        </section>
  
        <section name="User-submitted Benchmarks">
        <p>
        These benchmarks have been kindly submitted by Lucene users for 
  reference purposes. 
        </p>
        <p><b>We make NO guarantees regarding their accuracy or 
  validity.</b>
        </p>
        <p>We strongly recommend you conduct your own 
        performance benchmarks before deciding on a particular 
  hardware/software setup (and hopefully submit 
        these figures to us).
        </p>
        
          <subsection name="Hamish Carpenter's benchmarks">
            <ul>
            <p>
            <b>Hardware Environment</b><br/>
            <li><i>Dedicated machine for indexing</i>: yes</li>
            <li><i>CPU</i>: Intel x86 P4 1.5Ghz</li>
            <li><i>RAM</i>: 512 DDR</li>
            <li><i>Drive configuration</i>: IDE 7200rpm Raid-1</li>
            </p>
            <p>
            <b>Software environment</b><br/>
            <li><i>Java Version</i>: 1.3.1 IBM JITC Enabled</li>
            <li><i>Java VM</i>: </li>
            <li><i>OS Version</i>: Debian Linux 2.4.18-686</li>
            <li><i>Location of index</i>: local</li>
            </p>
            <p>
            <b>Lucene indexing variables</b><br/>
            <li><i>Number of source documents</i>: Random generator. Set

  to make 1M documents
  in 2x500,000 batches.</li>
            <li><i>Total filesize of source documents</i>: > 1GB if 
  stored</li>
            <li><i>Average filesize of source documents</i>: 1KB</li>
            <li><i>Source documents storage location</i>: Filesystem</li>
            <li><i>File type of source documents</i>: Generated</li>
            <li><i>Parser(s) used, if any</i>: </li>
            <li><i>Analyzer(s) used</i>: Default</li>
            <li><i>Number of fields per document</i>: 11</li>
            <li><i>Type of fields</i>: 1 date, 1 id, 9 text</li>
            <li><i>Index persistence</i>: FSDirectory</li>
            </p>
            <p>
            <b>Figures</b><br/>
            <li><i>Time taken (in ms/s as an average of at least 3 
  indexing runs)</i>: </li>
            <li><i>Time taken / 1000 docs indexed</i>: 49 seconds</li>
            <li><i>Memory consumption</i>:</li>
            </p>
            <p>
            <b>Notes</b><br/>
            <li><i>Notes</i>: 
            <p>
            A windows client ran a random document generator which 
  created
            documents based on some arrays of values and an excerpt 
  (approx 1kb)
            from a text file of the bible (King James version).<br/>
            These were submitted via a socket connection (open throughout
            indexing process).<br/>
            The index writer was not closed between index calls.<br/>
            This created a 400Mb index in 23 files (after 
  optimization).<br/>
            </p>
            <p>
            <u>Query details</u>:<br/>
            </p>
            <p>
            Set up a threaded class to start x number of simultaneous 
  threads to
            search the above created index.
            </p>
            <p>
            Query:  +Domain:sos +(+((Name:goo*^2.0 Name:plan*^2.0) 
  (Teaser:goo* Tea
            ser:plan*) (Details:goo* Details:plan*)) -Cancel:y)
            +DisplayStartDate:[mkwsw2jk0
            -mq3dj1uq0] +EndDate:[mq3dj1uq0-ntlxuggw0]
            </p>
            <p>
            This query counted 34000 documents and I limited the returned 
  documents
            to 5.
            </p>
            <p>
            This is using Peter Halacsy's IndexSearcherCache slightly 
  modified to
            be a singleton returned cached searchers for a given 
  directory. This
            solved an initial problem with too many files open and 
  running out of
            linux handles for them.
            </p>
            <pre>
            Threads|Avg Time per query (ms)
            1       1009ms
            2       2043ms
            3       3087ms
            4       4045ms
            ..        .
            ..        .
            10      10091ms
            </pre>
            <p>
            I removed the two date range terms from the query and it made 
  a HUGE
            difference in performance. With 4 threads the avg time 
  dropped to 900ms!
            </p>
            <p>Other query optimizations made little difference.</p></li>
            </p>
            </ul>
            <p>
            Hamish can be contacted at hamish at catalyst.net.nz.
            </p>
          </subsection>     
  
          <subsection name="Justin Greene's benchmarks">
            <ul>
            <p>
            <b>Hardware Environment</b><br/>
            <li><i>Dedicated machine for indexing</i>: No, but nominal 
  usage at time of indexing.</li>
            <li><i>CPU</i>: Compaq Proliant 1850R/600 2 X pIII 600</li>
            <li><i>RAM</i>: 1GB, 256MB allocated to JVM.</li>
            <li><i>Drive configuration</i>: RAID 5 on Fibre Channel 
  Array</li>
            </p>
            <p>
            <b>Software environment</b><br/>
            <li><i>Java Version</i>: 1.3.1_06</li>
            <li><i>Java VM</i>: </li>
            <li><i>OS Version</i>: Winnt 4/Sp6</li>
            <li><i>Location of index</i>: local</li>
            </p>
            <p>
            <b>Lucene indexing variables</b><br/>
            <li><i>Number of source documents</i>: about 60K</li>
            <li><i>Total filesize of source documents</i>: 6.5GB</li>
            <li><i>Average filesize of source documents</i>: 100K 
  (6.5GB/60K documents)</li>
            <li><i>Source documents storage location</i>: filesystem on

  NTFS</li>
            <li><i>File type of source documents</i>: </li>
            <li><i>Parser(s) used, if any</i>: Currently the only parser

  used is the Quiotix html
            parser.</li>
            <li><i>Analyzer(s) used</i>: SimpleAnalyzer</li>
            <li><i>Number of fields per document</i>: 8</li>
            <li><i>Type of fields</i>: All strings, and all are stored 
  and indexed.</li>
            <li><i>Index persistence</i>: FSDirectory</li>
            </p>
            <p>
            <b>Figures</b><br/>
            <li><i>Time taken (in ms/s as an average of at least 3 
  indexing runs)</i>: 1 hour 12 minutes, 1 hour 14 minutes and 1 hour 17 
  minutes.  Note that the #
            and size of documents changes daily.</li>
            <li><i>Time taken / 1000 docs indexed</i>: </li>
            <li><i>Memory consumption</i>: JVM is given 256MB and uses it

  all.</li>
            </p>
            <p>
            <b>Notes</b><br/>
            <li><i>Notes</i>: 
            <p>
            We have 10 threads reading files from the filesystem and 
  parsing and
            analyzing them and the pushing them onto a queue and a single 
  thread poping
            them from the queue and indexing.  Note that we are indexing 
  email messages
            and are storing the entire plaintext in of the message in the 
  index.  If the
            message contains attachment and we do not have a filter for 
  the attachment
            (ie. we do not do PDFs yet), we discard the data.
            </p></li>
            </p>
            </ul>
            <p>
            Justin can be contacted at tvxh-lw4x at spamex.com.
            </p>
          </subsection> 
  
        </section>
  
      </body>
  </document>
  
  
  
  
  1.1                  jakarta-lucene/docs/benchmarks.html
  
  Index: benchmarks.html
  ===================================================================
  <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
  
  <!-- Content Stylesheet for Site -->
  
          
  <!-- start the processing -->
      <!-- ====================================================================== -->
      <!-- Main Page Section -->
      <!-- ====================================================================== -->
      <html>
          <head>
              <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1"/>
  
                                                      <meta name="author" value="Kelvin
Tan">
              <meta name="email" value="kelvint@apache.org">
              
             
                                      
              <title>Jakarta Lucene - Resources - Performance Benchmarks</title>
          </head>
  
          <body bgcolor="#ffffff" text="#000000" link="#525D76">        
              <table border="0" width="100%" cellspacing="0">
                  <!-- TOP IMAGE -->
                  <tr>
                      <td align="left">
  <a href="http://jakarta.apache.org"><img src="http://jakarta.apache.org/images/jakarta-logo.gif"
border="0"/></a>
  </td>
  <td align="right">
  <a href="http://jakarta.apache.org/lucene/"><img src="./images/lucene_green_300.gif"
alt="Jakarta Lucene" border="0"/></a>
  </td>
                  </tr>
              </table>
              <table border="0" width="100%" cellspacing="4">
                  <tr><td colspan="2">
                      <hr noshade="" size="1"/>
                  </td></tr>
                  
                  <tr>
                      <!-- LEFT SIDE NAVIGATION -->
                      <td width="20%" valign="top" nowrap="true">
                                  <p><strong>About</strong></p>
          <ul>
                      <li>    <a href="./index.html">Overview</a>
  </li>
                      <li>    <a href="./powered.html">Powered by Lucene</a>
  </li>
                      <li>    <a href="./whoweare.html">Who We Are</a>
  </li>
                      <li>    <a href="http://jakarta.apache.org/site/mail.html">Mailing
Lists</a>
  </li>
                  </ul>
              <p><strong>Resources</strong></p>
          <ul>
                      <li>    <a href="http://lucene.sourceforge.net/cgi-bin/faq/faqmanager.cgi">FAQ
(Official)</a>
  </li>
                      <li>    <a href="http://www.jguru.com/faq/Lucene">jGuru
FAQ</a>
  </li>
                      <li>    <a href="./gettingstarted.html">Getting Started</a>
  </li>
                      <li>    <a href="./queryparsersyntax.html">Query Syntax</a>
  </li>
                      <li>    <a href="./fileformats.html">File Formats</a>
  </li>
                      <li>    <a href="./api/index.html">Javadoc</a>
  </li>
                      <li>    <a href="./contributions.html">Contributions</a>
  </li>
                      <li>    <a href="./lucene-sandbox/">Lucene Sandbox</a>
  </li>
                      <li>    <a href="./resources.html">Articles, etc.</a>
  </li>
                      <li>    <a href="./todo.html">TODO list</a>
  </li>
                      <li>    <a href="http://nagoya.apache.org/bugzilla/buglist.cgi?bug_status=NEW&bug_status=ASSIGNED&bug_status=REOPENED&email1=&emailtype1=substring&emailassigned_to1=1&email2=&emailtype2=substring&emailreporter2=1&bugidtype=include&bug_id=&changedin=&votes=&chfieldfrom=&chfieldto=Now&chfieldvalue=&product=Lucene&short_desc=%5BPATCH%5D&short_desc_type=allwordssubstr&long_desc=&long_desc_type=allwordssubstr&bug_file_loc=&bug_file_loc_type=allwordssubstr&keywords=&keywords_type=anywords&field0-0-0=noop&type0-0-0=noop&value0-0-0=&cmdtype=doit&order=%27Importance%27/todo.html">Patches</a>
  </li>
                      <li>    <a href="http://jakarta.apache.org/site/bugs.html">Bugs</a>
  </li>
                      <li>    <a href="http://nagoya.apache.org/bugzilla/buglist.cgi?bug_status=NEW&bug_status=ASSIGNED&bug_status=REOPENED&email1=&emailtype1=substring&emailassigned_to1=1&email2=&emailtype2=substring&emailreporter2=1&bugidtype=include&bug_id=&changedin=&votes=&chfieldfrom=&chfieldto=Now&chfieldvalue=&product=Lucene&short_desc=&short_desc_type=allwordssubstr&long_desc=&long_desc_type=allwordssubstr&bug_file_loc=&bug_file_loc_type=allwordssubstr&keywords=&keywords_type=anywords&field0-0-0=noop&type0-0-0=noop&value0-0-0=&cmdtype=doit&order=%27Importance%27">Lucene
Bugs</a>
  </li>
                      <li>    <a href="http://nagoya.apache.org/eyebrowse/SummarizeList?listId=30">Lucene-user</a>
  </li>
                      <li>    <a href="http://nagoya.apache.org/eyebrowse/SummarizeList?listId=29">Lucene-dev</a>
  </li>
                  </ul>
              <p><strong>Plans</strong></p>
          <ul>
                      <li>    <a href="./luceneplan.html">Application Extensions</a>
  </li>
                  </ul>
              <p><strong>Download</strong></p>
          <ul>
                      <li>    <a href="http://jakarta.apache.org/site/binindex.html">Binaries</a>
  </li>
                      <li>    <a href="http://jakarta.apache.org/site/sourceindex.html">Source
Code</a>
  </li>
                      <li>    <a href="http://jakarta.apache.org/site/cvsindex.html">CVS
Repositories</a>
  </li>
                  </ul>
              <p><strong>Jakarta</strong></p>
          <ul>
                      <li>    <a href="http://jakarta.apache.org/site/getinvolved.html">Get
Involved</a>
  </li>
                      <li>    <a href="http://jakarta.apache.org/site/acknowledgements.html">Acknowledgements</a>
  </li>
                      <li>    <a href="http://jakarta.apache.org/site/contact.html">Contact</a>
  </li>
                      <li>    <a href="http://jakarta.apache.org/site/legal.html">Legal</a>
  </li>
                  </ul>
                          </td>
                      <td width="80%" align="left" valign="top">
                                                                      <table border="0"
cellspacing="0" cellpadding="2" width="100%">
        <tr><td bgcolor="#525D76">
          <font color="#ffffff" face="arial,helvetica,sanserif">
            <a name="Performance Benchmarks"><strong>Performance Benchmarks</strong></a>
          </font>
        </td></tr>
        <tr><td>
          <blockquote>
                                      <p>
        The purpose of these user-submitted performance figures is to 
  give current and potential users of Lucene a sense 
        of how well Lucene scales. If the requirements for an upcoming 
  project is similar to an existing benchmark, you 
        will also have something to work with when designing the system 
  architecture for the application.
        </p>
                                                  <p>
        If you've conducted performance tests with Lucene, we'd 
  appreciate if you can submit these figures for display 
        on this page. Post these figures to the lucene-user mailing list 
  using this 
        <a href="benchmarktemplate.xml">template</a>.
        </p>
                              </blockquote>
          </p>
        </td></tr>
        <tr><td><br/></td></tr>
      </table>
                                                  <table border="0" cellspacing="0" cellpadding="2"
width="100%">
        <tr><td bgcolor="#525D76">
          <font color="#ffffff" face="arial,helvetica,sanserif">
            <a name="Benchmark Variables"><strong>Benchmark Variables</strong></a>
          </font>
        </td></tr>
        <tr><td>
          <blockquote>
                                      <p>
        <ul>
        <p>
        <b>Hardware Environment</b><br />
        <li><i>Dedicated machine for indexing</i>: Self-explanatory 
  (yes/no)</li>
        <li><i>CPU</i>: Self-explanatory (Type, Speed and Quantity)</li>
        <li><i>RAM</i>: Self-explanatory</li>
        <li><i>Drive configuration</i>: Self-explanatory (IDE, SCSI, 
  RAID-1, RAID-5)</li>
        </p>
        <p>
        <b>Software environment</b><br />
        <li><i>Java Version</i>: Version of Java SDK/JRE that is run 
  </li>
        <li><i>Java VM</i>: Server/client VM, Sun VM/JRockIt</li>
        <li><i>OS Version</i>: Self-explanatory</li>
        <li><i>Location of index</i>: Is the index stored in filesystem

  or database? Is it on the same server(local) or 
        over the network?</li>
        </p>
        <p>
        <b>Lucene indexing variables</b><br />
        <li><i>Number of source documents</i>: Number of documents being

  indexed</li>
        <li><i>Total filesize of source documents</i>: 
  Self-explanatory</li>
        <li><i>Average filesize of source documents</i>: 
  Self-explanatory</li>
        <li><i>Source documents storage location</i>: Where are the 
  documents being indexed located? 
          Filesystem, DB, http,etc</li>
        <li><i>File type of source documents</i>: Types of files being 
  indexed, e.g. HTML files, XML files, PDF files, etc.</li>
        <li><i>Parser(s) used, if any</i>: Parsers used for parsing the

  various files for indexing, 
          e.g. XML parser, HTML parser, etc.</li>
        <li><i>Analyzer(s) used</i>: Type of Lucene analyzer used</li>
        <li><i>Number of fields per document</i>: Number of Fields each

  Document contains</li>
        <li><i>Type of fields</i>: Type of each field</li>
        <li><i>Index persistence</i>: Where the index is stored, e.g. 
  FSDirectory, SqlDirectory, etc</li>
        </p>
        <p>
        <b>Figures</b><br />
        <li><i>Time taken (in ms/s as an average of at least 3 indexing 
  runs)</i>: Time taken to index all files</li>
        <li><i>Time taken / 1000 docs indexed</i>: Time taken to index 
  1000 files</li>
        <li><i>Memory consumption</i>: Self-explanatory</li>
        </p>
        <p>
        <b>Notes</b><br />
        <li><i>Notes</i>: Any comments which don't belong in the above,

  special tuning/strategies, etc</li>
        </p>
        </ul>
        </p>
                              </blockquote>
          </p>
        </td></tr>
        <tr><td><br/></td></tr>
      </table>
                                                  <table border="0" cellspacing="0" cellpadding="2"
width="100%">
        <tr><td bgcolor="#525D76">
          <font color="#ffffff" face="arial,helvetica,sanserif">
            <a name="User-submitted Benchmarks"><strong>User-submitted Benchmarks</strong></a>
          </font>
        </td></tr>
        <tr><td>
          <blockquote>
                                      <p>
        These benchmarks have been kindly submitted by Lucene users for 
  reference purposes. 
        </p>
                                                  <p><b>We make NO guarantees
regarding their accuracy or 
  validity.</b>
        </p>
                                                  <p>We strongly recommend you conduct
your own 
        performance benchmarks before deciding on a particular 
  hardware/software setup (and hopefully submit 
        these figures to us).
        </p>
                                                      <table border="0" cellspacing="0"
cellpadding="2" width="100%">
        <tr><td bgcolor="#828DA6">
          <font color="#ffffff" face="arial,helvetica,sanserif">
            <a name="Hamish Carpenter's benchmarks"><strong>Hamish Carpenter's
benchmarks</strong></a>
          </font>
        </td></tr>
        <tr><td>
          <blockquote>
                                      <ul>
            <p>
            <b>Hardware Environment</b><br />
            <li><i>Dedicated machine for indexing</i>: yes</li>
            <li><i>CPU</i>: Intel x86 P4 1.5Ghz</li>
            <li><i>RAM</i>: 512 DDR</li>
            <li><i>Drive configuration</i>: IDE 7200rpm Raid-1</li>
            </p>
            <p>
            <b>Software environment</b><br />
            <li><i>Java Version</i>: 1.3.1 IBM JITC Enabled</li>
            <li><i>Java VM</i>: </li>
            <li><i>OS Version</i>: Debian Linux 2.4.18-686</li>
            <li><i>Location of index</i>: local</li>
            </p>
            <p>
            <b>Lucene indexing variables</b><br />
            <li><i>Number of source documents</i>: Random generator. Set

  to make 1M documents
  in 2x500,000 batches.</li>
            <li><i>Total filesize of source documents</i>: &gt; 1GB
if 
  stored</li>
            <li><i>Average filesize of source documents</i>: 1KB</li>
            <li><i>Source documents storage location</i>: Filesystem</li>
            <li><i>File type of source documents</i>: Generated</li>
            <li><i>Parser(s) used, if any</i>: </li>
            <li><i>Analyzer(s) used</i>: Default</li>
            <li><i>Number of fields per document</i>: 11</li>
            <li><i>Type of fields</i>: 1 date, 1 id, 9 text</li>
            <li><i>Index persistence</i>: FSDirectory</li>
            </p>
            <p>
            <b>Figures</b><br />
            <li><i>Time taken (in ms/s as an average of at least 3 
  indexing runs)</i>: </li>
            <li><i>Time taken / 1000 docs indexed</i>: 49 seconds</li>
            <li><i>Memory consumption</i>:</li>
            </p>
            <p>
            <b>Notes</b><br />
            <li><i>Notes</i>: 
            <p>
            A windows client ran a random document generator which 
  created
            documents based on some arrays of values and an excerpt 
  (approx 1kb)
            from a text file of the bible (King James version).<br />
            These were submitted via a socket connection (open throughout
            indexing process).<br />
            The index writer was not closed between index calls.<br />
            This created a 400Mb index in 23 files (after 
  optimization).<br />
            </p>
            <p>
            <u>Query details</u>:<br />
            </p>
            <p>
            Set up a threaded class to start x number of simultaneous 
  threads to
            search the above created index.
            </p>
            <p>
            Query:  +Domain:sos +(+((Name:goo*^2.0 Name:plan*^2.0) 
  (Teaser:goo* Tea
            ser:plan*) (Details:goo* Details:plan*)) -Cancel:y)
            +DisplayStartDate:[mkwsw2jk0
            -mq3dj1uq0] +EndDate:[mq3dj1uq0-ntlxuggw0]
            </p>
            <p>
            This query counted 34000 documents and I limited the returned 
  documents
            to 5.
            </p>
            <p>
            This is using Peter Halacsy's IndexSearcherCache slightly 
  modified to
            be a singleton returned cached searchers for a given 
  directory. This
            solved an initial problem with too many files open and 
  running out of
            linux handles for them.
            </p>
            <pre>
            Threads|Avg Time per query (ms)
            1       1009ms
            2       2043ms
            3       3087ms
            4       4045ms
            ..        .
            ..        .
            10      10091ms
            </pre>
            <p>
            I removed the two date range terms from the query and it made 
  a HUGE
            difference in performance. With 4 threads the avg time 
  dropped to 900ms!
            </p>
            <p>Other query optimizations made little difference.</p></li>
            </p>
            </ul>
                                                  <p>
            Hamish can be contacted at hamish at catalyst.net.nz.
            </p>
                              </blockquote>
        </td></tr>
        <tr><td><br/></td></tr>
      </table>
                                                      <table border="0" cellspacing="0"
cellpadding="2" width="100%">
        <tr><td bgcolor="#828DA6">
          <font color="#ffffff" face="arial,helvetica,sanserif">
            <a name="Justin Greene's benchmarks"><strong>Justin Greene's benchmarks</strong></a>
          </font>
        </td></tr>
        <tr><td>
          <blockquote>
                                      <ul>
            <p>
            <b>Hardware Environment</b><br />
            <li><i>Dedicated machine for indexing</i>: No, but nominal 
  usage at time of indexing.</li>
            <li><i>CPU</i>: Compaq Proliant 1850R/600 2 X pIII 600</li>
            <li><i>RAM</i>: 1GB, 256MB allocated to JVM.</li>
            <li><i>Drive configuration</i>: RAID 5 on Fibre Channel 
  Array</li>
            </p>
            <p>
            <b>Software environment</b><br />
            <li><i>Java Version</i>: 1.3.1_06</li>
            <li><i>Java VM</i>: </li>
            <li><i>OS Version</i>: Winnt 4/Sp6</li>
            <li><i>Location of index</i>: local</li>
            </p>
            <p>
            <b>Lucene indexing variables</b><br />
            <li><i>Number of source documents</i>: about 60K</li>
            <li><i>Total filesize of source documents</i>: 6.5GB</li>
            <li><i>Average filesize of source documents</i>: 100K 
  (6.5GB/60K documents)</li>
            <li><i>Source documents storage location</i>: filesystem on

  NTFS</li>
            <li><i>File type of source documents</i>: </li>
            <li><i>Parser(s) used, if any</i>: Currently the only parser

  used is the Quiotix html
            parser.</li>
            <li><i>Analyzer(s) used</i>: SimpleAnalyzer</li>
            <li><i>Number of fields per document</i>: 8</li>
            <li><i>Type of fields</i>: All strings, and all are stored 
  and indexed.</li>
            <li><i>Index persistence</i>: FSDirectory</li>
            </p>
            <p>
            <b>Figures</b><br />
            <li><i>Time taken (in ms/s as an average of at least 3 
  indexing runs)</i>: 1 hour 12 minutes, 1 hour 14 minutes and 1 hour 17 
  minutes.  Note that the #
            and size of documents changes daily.</li>
            <li><i>Time taken / 1000 docs indexed</i>: </li>
            <li><i>Memory consumption</i>: JVM is given 256MB and uses it

  all.</li>
            </p>
            <p>
            <b>Notes</b><br />
            <li><i>Notes</i>: 
            <p>
            We have 10 threads reading files from the filesystem and 
  parsing and
            analyzing them and the pushing them onto a queue and a single 
  thread poping
            them from the queue and indexing.  Note that we are indexing 
  email messages
            and are storing the entire plaintext in of the message in the 
  index.  If the
            message contains attachment and we do not have a filter for 
  the attachment
            (ie. we do not do PDFs yet), we discard the data.
            </p></li>
            </p>
            </ul>
                                                  <p>
            Justin can be contacted at tvxh-lw4x at spamex.com.
            </p>
                              </blockquote>
        </td></tr>
        <tr><td><br/></td></tr>
      </table>
                              </blockquote>
          </p>
        </td></tr>
        <tr><td><br/></td></tr>
      </table>
                                          </td>
                  </tr>
  
                  <!-- FOOTER -->
                  <tr><td colspan="2">
                      <hr noshade="" size="1"/>
                  </td></tr>
                  <tr><td colspan="2">
                      <div align="center"><font color="#525D76" size="-1"><em>
                      Copyright &#169; 1999-2002, Apache Software Foundation
                      </em></font></div>
                  </td></tr>
              </table>
          </body>
      </html>
  <!-- end the processing -->
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  1.1                  jakarta-lucene/docs/benchmarktemplate.xml
  
  Index: benchmarktemplate.xml
  ===================================================================
  <benchmark>
    <ul>
    <p>
    <b>Hardware Environment</b><br/>
    <li><i>Dedicated machine for indexing</i>: Self-explanatory 
  (yes/no)</li>
    <li><i>CPU</i>: Self-explanatory (Type, Speed and Quantity)</li>
    <li><i>RAM</i>: Self-explanatory</li>
    <li><i>Drive configuration</i>: Self-explanatory (IDE, SCSI, RAID-1,

  RAID-5)</li>
    </p>
    <p>
    <b>Software environment</b><br/>
    <li><i>Java Version</i>: Version of Java SDK/JRE that is run </li>
    <li><i>Java VM</i>: Server/client VM, Sun VM/JRockIt</li>
    <li><i>OS Version</i>: Self-explanatory</li>
    <li><i>Location of index</i>: Is the index stored in filesystem or 
  database? Is it on the same server(local) or 
    over the network?</li>
    </p>
    <p>
    <b>Lucene indexing variables</b><br/>
    <li><i>Number of source documents</i>: Number of documents being 
  indexed</li>
    <li><i>Total filesize of source documents</i>: Self-explanatory</li>
    <li><i>Average filesize of source documents</i>: 
  Self-explanatory</li>
    <li><i>Source documents storage location</i>: Where are the documents

  being indexed located? 
      Filesystem, DB, http,etc</li>
    <li><i>File type of source documents</i>: Types of files being 
  indexed, e.g. HTML files, XML files, PDF files, etc.</li>
    <li><i>Parser(s) used, if any</i>: Parsers used for parsing the 
  various files for indexing, 
      e.g. XML parser, HTML parser, etc.</li>
    <li><i>Analyzer(s) used</i>: Type of Lucene analyzer used</li>
    <li><i>Number of fields per document</i>: Number of Fields each 
  Document contains</li>
    <li><i>Type of fields</i>: Type of each field</li>
    <li><i>Index persistence</i>: Where the index is stored, e.g. 
  FSDirectory, SqlDirectory, etc</li>
    </p>
    <p>
    <b>Figures</b><br/>
    <li><i>Time taken (in ms/s as an average of at least 3 indexing 
  runs)</i>: Time taken to index to index all files</li>
    <li><i>Time taken / 1000 docs indexed</i>: Time taken to index 1000

  files</li>
    <li><i>Memory consumption</i>: Self-explanatory</li>
    </p>
    <p>
    <b>Notes</b><br/>
    <li><i>Notes</i>: Any comments which don't belong in the above, 
  special tuning/strategies, etc</li>
    </p>
    </ul>
  </benchmark>
  
  
  

--
To unsubscribe, e-mail:   <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>


Mime
View raw message