Return-Path: Delivered-To: apmail-jakarta-lucene-user-archive@apache.org Received: (qmail 80150 invoked from network); 23 Dec 2002 14:58:55 -0000 Received: from exchange.sun.com (192.18.33.10) by daedalus.apache.org with SMTP; 23 Dec 2002 14:58:55 -0000 Received: (qmail 9377 invoked by uid 97); 23 Dec 2002 15:00:02 -0000 Delivered-To: qmlist-jakarta-archive-lucene-user@jakarta.apache.org Received: (qmail 9217 invoked by uid 97); 23 Dec 2002 14:59:59 -0000 Mailing-List: contact lucene-user-help@jakarta.apache.org; run by ezmlm Precedence: bulk List-Unsubscribe: List-Subscribe: List-Help: List-Post: List-Id: "Lucene Users List" Reply-To: "Lucene Users List" Delivered-To: mailing list lucene-user@jakarta.apache.org Received: (qmail 9055 invoked by uid 98); 23 Dec 2002 14:59:57 -0000 X-Antivirus: nagoya (v4218 created Aug 14 2002) Message-Id: <818623B5FD23D51193200002B32C076106FE4987@excsrv44.mayo.edu> From: "Armbrust, Daniel C." To: 'Lucene Users List' Subject: RE: Lucene Benchmarks and Information Date: Mon, 23 Dec 2002 08:58:36 -0600 MIME-Version: 1.0 X-Mailer: Internet Mail Service (5.5.2653.19) Content-Type: text/plain X-Spam-Rating: daedalus.apache.org 1.6.2 0/1000/N X-Spam-Rating: daedalus.apache.org 1.6.2 0/1000/N I ran some more query sets on the various size indexes, this time the queries contained 5 up to 5 word long phrases. While the queries took a lot longer to run (as expected) the speed per query still came out to be a linear growth with the index size. Dan -----Original Message----- From: Armbrust, Daniel C. [mailto:Armbrust.Daniel@mayo.edu] Sent: Friday, December 20, 2002 12:55 PM To: 'Lucene Users List' Subject: RE: Lucene Benchmarks and Information The query's were definitely not very intelligently built. It was a last minute thing I decided to do for the heck of it, as my main reason for thrashing my hard drives in this exercise was to make sure I could run the document count up significantly higher than what we are currently up to at work. The operator was chosen randomly, followed by a random field, followed by a random word. I didn't put any phrases in, as I expected the number of hits I got would be quite low (since the documents were also randomly generated) but in retrospect, even with near 0 results, it would probably be interesting. Maybe I'll run a couple tonight if I get a chance. -----Original Message----- From: Jonathan Reichhold [mailto:reichhold@singingfish.com] Sent: Friday, December 20, 2002 12:24 PM To: 'Lucene Users List' Subject: RE: Lucene Benchmarks and Information A question on the queries you used. What sort of distribution of terms did you use? I.e. were all the queries single random words, or did you add in multi-word queries and phrases? I'm impressed with the results, just want to understand the testing methodology better. JR -----Original Message----- From: Armbrust, Daniel C. [mailto:Armbrust.Daniel@mayo.edu] Sent: Friday, December 20, 2002 8:57 AM To: 'Lucene Users List' Subject: Lucene Benchmarks and Information I've been running some scalability tests on Lucene over the past couple of weeks. While there may be some flaws with some of my methods, I think they will be useful for people that want an idea as to how Lucene will scale. If anyone has any questions about what I did, or wants clarifications on something, I'll be happy to provide them. I'll start by filling out the form Hardware Environment * Dedicated machine for indexing: yes * CPU: 1 2.53 GHz Pentium 4 * RAM: Self-explanatory * Drive configuration: 100 GB 7200 RPM IDE, 80 GB 7200 RPM IDE Software environment * Java Version: java version "1.3.1" Java(TM) 2 Runtime Environment, Standard Edition (build 1.3.1) Classic VM (build 1.3.1, J2RE 1.3.1 IBM Windows 32 build cn131-20020403 (JIT enabled: jitc)) * OS Version: Win XP SP1 * Location of index: Local File Systems Lucene indexing variables * Number of source documents: 43,779,000 * Total filesize of source documents: ~350 GB -- never stored (documents were randomly generated) * Average filesize of source documents: 8 KB * Source documents storage location: Generated while indexing, never written to disk * File type of source documents: text * Parser(s) used, if any: None * Analyzer(s) used: Standard Analyzer * Number of fields per document: 2 * Type of fields: text, Unstored * Index persistence: FSDirectory Figures * Time taken (in ms/s as an average of at least 3 indexing runs): See notes below * Time taken / 1000 docs indexed: 6.5 seconds/1000, not counting optimization time. 15 seconds/1000 when optimizing every 100,000 documents, and building an index to ~ 5 million documents. Above 5 million documents, optimization took too much time. See notes below. * Memory consumption: ~ 200 mb * Index Size: 70.7 GB Notes * Notes: The documents were randomly generated on the fly as part of the indexing process from a list of ~100,000 words, who's average length was 7. The documents had 3 words in the title, and 500 words in the body. While I was trying to build this index, the biggest limitation of Lucene that I ran into was optimization. Optimization kills the indexers performance when you get between 3-5 million documents in an index. On my Windows XP box, I had to reoptimize every 100,000 documents to keep from running out of file handles. While I could build a 5 million document index in 24 hours... I could only add about another million over the next 24 hours due to the pain of the optimizer recopying the entire index over and over again (about 10 GB at this point), and it would only get worse from there. So, to build this large of an index, I built several ~ 5 million document indexes, and then merged them at the end into a single index. The second issue (though not really a problem) was that you have to have at least double the disk space available to build the index as you need when you are done. I could have kept building the index bigger, but I ran out of disk space. When I was done building indexes, I ran some query's against them to see how the search performance varied with the size of the index. Following are my results for various size indexes. Index Size (GB) MS per query 4.53 83 7.92 83 10 89 12.7 112 52.5 694 70.7 944 These numbers are an average of 3 runs of 500 randomly generated queries being tossed at the index (single threaded) on the same hardware that built the index. The queries were randomly generated (about 50 % of the queries had 0 results, 50% had 1 or more results) I was happy to see that these numbers make a nice linear plot (attached). I'm not sure what other comments to add here, other to thank the authors of Lucene for their great design and implementation of Lucene. If anyone has anything else they would like me to test on this index before I dump it... Speak up quick, I have to pull out one of the hard drives this weekend to pass it on to its real owner. Dan -- To unsubscribe, e-mail: For additional commands, e-mail: -- To unsubscribe, e-mail: For additional commands, e-mail: -- To unsubscribe, e-mail: For additional commands, e-mail: