Return-Path: Delivered-To: apmail-jakarta-lucene-user-archive@www.apache.org Received: (qmail 67301 invoked from network); 18 Jan 2005 19:04:08 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (209.237.227.199) by minotaur-2.apache.org with SMTP; 18 Jan 2005 19:04:08 -0000 Received: (qmail 21943 invoked by uid 500); 18 Jan 2005 19:04:01 -0000 Delivered-To: apmail-jakarta-lucene-user-archive@jakarta.apache.org Received: (qmail 21923 invoked by uid 500); 18 Jan 2005 19:04:01 -0000 Mailing-List: contact lucene-user-help@jakarta.apache.org; run by ezmlm Precedence: bulk List-Unsubscribe: List-Subscribe: List-Help: List-Post: List-Id: "Lucene Users List" Reply-To: "Lucene Users List" Delivered-To: mailing list lucene-user@jakarta.apache.org Received: (qmail 21909 invoked by uid 99); 18 Jan 2005 19:04:01 -0000 X-ASF-Spam-Status: No, hits=0.0 required=10.0 tests= X-Spam-Check-By: apache.org Received-SPF: pass (hermes.apache.org: local policy) Received: from mail3.qsent.com (HELO exchange.qsent.com) (208.252.86.23) by apache.org (qpsmtpd/0.28) with ESMTP; Tue, 18 Jan 2005 11:03:59 -0800 X-MimeOLE: Produced By Microsoft Exchange V6.0.6603.0 content-class: urn:content-classes:message MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable Subject: ParallellMultiSearcher Vs. One big Index Date: Tue, 18 Jan 2005 11:04:08 -0800 Message-ID: <5238CD8601F3EF4BA3C5553FECBB7D2A03884F1B@exchange.qsent.com> X-MS-Has-Attach: X-MS-TNEF-Correlator: Thread-Topic: ParallellMultiSearcher Vs. One big Index Thread-Index: AcT6fR3E1n4jrwc3Q0mf2/F9h8CZpgDECVPw From: "Ryan Aslett" To: "Lucene Users List" X-Virus-Checked: Checked X-Spam-Rating: minotaur-2.apache.org 1.6.2 0/1000/N =20 Okay, so Im trying to find the sweet spot on how many index segments I should have. I have 47 million records of contact data (Name + Address). I used 7 machines to build indexes that resulted in the following spread of individual indexes: 1503000 1500000 1497000 5604750 5379750 1437000 1458000 1446000 1422000 1425000 1425000 1404000 1413000 1404000 4893750 4689750 4519500 4497750 46919250 Total Records (The faster machines built the bigger indexes) I also joined all these indexes together into one large 47 million record index, and ran my query pounder against both data sets, one using the ParallellMultiSearcher for the multi indexes, and one using a normal IndexSearcher against the large index. What I found was that for queries with one term (First Name), the large index beat the multiple indexes hands down (280 Queries/per second vs 170 Q/s). But for queries with multiple terms (Address), the multiple indexes beat out the Large index. (26 Q/s vs 16 Q/s) Btw, Im running these on a 2 proc box with 16GB of ram. So what Im trying to determine Is if there is some equations out there that can help me find the sweet spot for splitting my indexes. Most queries are going to be multi-term, and clearly the big O of the single term search appears to be log n. (I verified with 470 million records.. The single term search returns at 140 qps, consistent with what I believe about search algorithms). The equation that Im missing is the big O for the union of the result sets that match particular terms. Im assuming (havent looked at the source yet) that lucene finds all the documents that match the first term, and all the documents that match each subsequent term, and then finds the union between all the sets. Is this correct? Anybody have any ideas on how to iron out an equation for this? Ryan --------------------------------------------------------------------- To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org For additional commands, e-mail: lucene-user-help@jakarta.apache.org