Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 66870 invoked from network); 27 Feb 2007 14:30:49 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 27 Feb 2007 14:30:49 -0000 Received: (qmail 90176 invoked by uid 500); 27 Feb 2007 14:30:51 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 90142 invoked by uid 500); 27 Feb 2007 14:30:51 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 90131 invoked by uid 99); 27 Feb 2007 14:30:51 -0000 Received: from herse.apache.org (HELO herse.apache.org) (140.211.11.133) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 27 Feb 2007 06:30:51 -0800 X-ASF-Spam-Status: No, hits=2.9 required=10.0 tests=HTML_10_20,HTML_MESSAGE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (herse.apache.org: domain of erickerickson@gmail.com designates 209.85.132.242 as permitted sender) Received: from [209.85.132.242] (HELO an-out-0708.google.com) (209.85.132.242) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 27 Feb 2007 06:30:39 -0800 Received: by an-out-0708.google.com with SMTP id c3so1077210ana for ; Tue, 27 Feb 2007 06:30:18 -0800 (PST) DKIM-Signature: a=rsa-sha1; c=relaxed/relaxed; d=gmail.com; s=beta; h=domainkey-signature:received:received:message-id:date:from:to:subject:mime-version:content-type; b=WUtThoA2qAYy6HiMBf0gexwHO0xJpLbfUbIjc6KvsvVeDStZu2KpYWon8CpbKmRH0DBtfUF7ue4ig4vJWgGEqV5tU8tFxsFV1b94tUHzTPxSaJ1pJ2FW+DkpoAfW5feKsylrmmOpovef97BsVRZ3rO/nxPeBMS6WxEbLl1HFGoU= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=beta; h=received:message-id:date:from:to:subject:mime-version:content-type; b=N5s0tCoEvThvlKQmfZYCMSo53SXsqgJL2NRkirhQaA19U8p9g+HAXdO5c5iDwk9AZgkUpP4LtISaku2R92cwUM3Lvr0BFuU/YS6nRzt5IoteFMOFKxWtSSIhRGEUkOSIg7ESB/igZtUOMslKDbVJQBfKG1GtlCK7a5yMWxJ6+2s= Received: by 10.114.197.1 with SMTP id u1mr2544967waf.1172586614813; Tue, 27 Feb 2007 06:30:14 -0800 (PST) Received: by 10.114.58.3 with HTTP; Tue, 27 Feb 2007 06:30:14 -0800 (PST) Message-ID: <359a92830702270630r76fba4dcl9855ef6c06ca6637@mail.gmail.com> Date: Tue, 27 Feb 2007 09:30:14 -0500 From: "Erick Erickson" To: java-user@lucene.apache.org Subject: Lucene 2.1, using FieldSelector speeds up my app by a factor of 10+, numbers attached MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="----=_Part_12213_11435724.1172586614685" X-Virus-Checked: Checked by ClamAV on apache.org ------=_Part_12213_11435724.1172586614685 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Content-Disposition: inline I thought I'd put up some numbers that may be useful for people who find themselves doing performance tuning and/or are just curious. See then end of this e-mail for design notes DISCLAIMER: Your results may vary. Once I figured out the speed-up I got by using FieldSelector, I stopped looking for further improvements or refining my test harness since we're now getting better than 3 times the design performance target. So, while I'm quite confident I'm seeing a *very* significant improvement, these numbers aren't all that precise. I'm into the performance tuning phase now, so I wrote a little test harness that creates a configurable number of threads firing queries off at my search engine with no delays, first firing off a warm-up query before starting any of the threads. It's a fairly simple measurement, but the results are pretty consistent and way better than "one-one thousand, two - one thousand"... This particular application returns lots summaries at a time as a result of a search, the default is 500. This is summary information, so I only return 6 fields from each document. I'm using a TopDocs to assemble results. Baseline QPS for returning 1,000 results 0.9 or so queries per second (QPS), before any tuning. This is not acceptable in our app..... So I started by asking: What happens if I retrieve only one doc? What happens if I retrieve 100 docs? What happens if I retrieve 1000 docs? All the above require the same search effort, including sorting, so the fact that my results were as follows lead me to scratch my head since I expected the time to be spent in searching and sorting. Note that these numbers are with default (relevance) sorting. Sorting on other fields costs about 0.2 QPS, so I'll ignore them. returning 1 doc, 33 qps returning 100 docs, 4.34 qps returning 1000 docs, 0.88 qps (ZERO.88. Less than 1) Hmmmm, sez I. This is surprising. So I commented out the document fetch and kludged in hard-coded responses for the data I would have gotten from the loaded document and got 11 QPS. So then I uncommented the document fetch (without FieldSelector) but still used fake field data and was back to 0.89 QPS. Men have been hung on flimsier evidence. So, I poked around and found FieldSelector, which has been mentioned several times on the mailing list, but I hadn't found reason to use it yet. It took about 1/2 hour to implement and run my first test. Then I spent another hour realizing that I had foolishly excluded a couple of compressed un-indexed fields that could be loaded. If a field can be loaded the usual way, it can be loaded with a FieldSelector. Sheeeesh... Anyway, here's the results of using FieldSelector to load only the fields I need. returning 1,000 docs 12.5 QPS excluding the 2 compressed fields. (just skipping them) returning 1,000 docs 7.14 QPS including loading the compressed fields So, I regenerated the index without compressing those two fields, and the result is returning 1,000 docs, all necessary fields, none compressed: 9 QPS The regenerated index has two fields (one an integer and one the title of the book) that were stored compressed and not indexed in the 7.14 QPS case, and stored and indexed UN_TOKENIZED in the 9 QPS case. No, don't ask me what I was thinking when I compressed a 4 digit field. I plead advancing senility. And the little moral here, one I return to repeatedly. The preliminary test took me maybe 3 hours to write and get the first set of anomalous results, which pointed me in a completely different direction than I expected. There's no substitute for data when performance tuning. Design notes: I strongly suspect that the meta-data heavy design of this index is the main reason for the differences I'm finding when I use IndexReader.document(doc, FieldSelector) rather than IndexReader.document(doc). I doubt (but have no evidence) that an index with no meta-data would get this kind of performance improvement. My particular application indexes 20,000+ books, some of them quite large (i.e. over 7,000 pages). The index approaches 8G. I designed it to avoid needing a database, so I store a LOT of data I don't search. Some of it is compressed and the meta-data is not indexed. The point is that in this particular application there may be as much data stored as indexed for each book. And extracting it, particularly the compressed fields (which may be quite large) turns out to be expensive. I haven't calculated an exact ratio of stored to indexed data. And, far and away the largest amount of meta-data (I'm guessing 90%) is irrelevant to the search results I'm concentrating on here. So avoiding the overhead of loading the unneeded meta-data is where the savings is coming from I believe. The underpinnings of this design is that I need to search lots of page text, but only when displaying a specific book do I care about things like how many pages are in each chapter, the start and end page of each chapter, the size of the image corresponding to each page, etc. I never have to search the meta-data so I store it but don't index it. This allows me to avoid connecting to a database, simplifying the application considerably. Let me add a HUGE thanks for the FieldSelector (a subset of lazy loading?) and the work that went into it. It's a rare pleasure (actually, not all that rare in Lucene ) to find a ready-made solution to my problem if I'm just smart enough to look for it. Otis, Yonik, Eric et.al. Feel free to add anything from this e-mail to any documentation you wish if you think it'd be useful there. Best Erick ------=_Part_12213_11435724.1172586614685--