Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 63977 invoked from network); 8 Dec 2006 11:44:49 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 8 Dec 2006 11:44:49 -0000 Received: (qmail 94062 invoked by uid 500); 8 Dec 2006 11:44:49 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 93936 invoked by uid 500); 8 Dec 2006 11:44:48 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 93925 invoked by uid 99); 8 Dec 2006 11:44:48 -0000 Received: from herse.apache.org (HELO herse.apache.org) (140.211.11.133) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 08 Dec 2006 03:44:48 -0800 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (herse.apache.org: domain of grant.ingersoll@gmail.com designates 66.249.82.239 as permitted sender) Received: from [66.249.82.239] (HELO wx-out-0506.google.com) (66.249.82.239) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 08 Dec 2006 03:44:36 -0800 Received: by wx-out-0506.google.com with SMTP id i29so725178wxd for ; Fri, 08 Dec 2006 03:44:15 -0800 (PST) DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws; s=beta; d=gmail.com; h=received:mime-version:in-reply-to:references:content-type:message-id:content-transfer-encoding:from:subject:date:to:x-mailer; b=Kf52TLjbBwXzH2Addjrc2RhrwiDpFS1KSbm/8CuswgKWl8pz37Ojpy3zCDstOo0i5Zf2WElEKBgVzDnLA7CgftxXGlXwICU/xFdMoc0yxHSzHcsePP2sKMyNhZDrfGpciwCVb5iVKauRbUkLSLEA/u/q3iDjpi6erydYZwpCbuY= Received: by 10.90.25.3 with SMTP id 3mr3846535agy.1165578255817; Fri, 08 Dec 2006 03:44:15 -0800 (PST) Received: from ?192.168.0.2? ( [74.229.189.244]) by mx.google.com with ESMTP id 34sm2980776wra.2006.12.08.03.44.15; Fri, 08 Dec 2006 03:44:15 -0800 (PST) Mime-Version: 1.0 (Apple Message framework v752.2) In-Reply-To: <6c33f9950612072210q19fdd2b8u8895f7668e1edeaf@mail.gmail.com> References: <6c33f9950612072210q19fdd2b8u8895f7668e1edeaf@mail.gmail.com> Content-Type: text/plain; charset=ISO-8859-1; delsp=yes; format=flowed Message-Id: <0F0F8794-B7F5-4D8B-A384-5A7AFEF80F2C@gmail.com> Content-Transfer-Encoding: quoted-printable From: Grant Ingersoll Subject: Re: Optimizing search speed & performance for a 10G Index. Date: Fri, 8 Dec 2006 06:44:13 -0500 To: java-user@lucene.apache.org X-Mailer: Apple Mail (2.752.2) X-Virus-Checked: Checked by ClamAV on apache.org Have you done any profiling of your application yet to identify =20 bottlenecks (i.e. are you sure it is Lucene)? Without some =20 profiling, you really will just be guessing. Also, search this and =20 the dev. list for performance, as there have been many lengthy =20 discussions in the past on optimizations that may give you some =20 ideas. Is there any way you can make it so you don't spawn extra =20 searches? Also, how are you handling the newsdate field? Range Query vs. Range =20= Filter. Do you have any fields in your documents that are large, stored =20 fields? Lazy loading and/or the field selector may help there. =20 Search this list for info or the dev list. How are you creating your queries? Is there a lot of analysis =20 involved?=CE Of course, there always comes a time when you need to look at =20 distributing the load, but I am not sure if you are there yet, as I =20 seem to recall people being able to handle 10gb w/o too much problem =20 on a machine of that size, but I could be wrong. -Grant On Dec 8, 2006, at 1:10 AM, Chun Wei Ho wrote: > Hi, > > We run a search engine based on Lucene 1.9.1 / Nutch 0.7.2. Our index > has approximately 2 million documents and the physical size of it is > about 10 GB. We run it as a tomcat web application on a Fedora Core 4 > server with duo Xeon 3.2GHz processors and 4GB RAM. > > We receive about 46500 web search requests a day (ranging from 50-300 > requests per 5 minutes across the day). Each web search request could > spawn about one to three actual Lucene searches. Our average response > time (calculated from the server side - and so excludes network > latency), is about 2 seconds. > > Does this timing of 2 seconds appear plausible for Lucene, based on > the machine specifications above. > > > Our index is slightly more complex (with multiple fields like title, > location, site, content). For example, a search for "Linux and Lucene" > related entries in "Australia" might result in lucene searches for: > > ((title:linux^1.0 title:lucene^1.0)^4.0) > +(( > +(title:linux^5.0 location:linux^1.5 content:linux^1.0) > +(title:lucene^5.0 location:lucene^1.5 content:lucene^1.0)) > ((+(+content:linux +content:lucene)) +(site:contentsite1 > site:contentsite2 site:contentsite3 site:contentsite4 > site:contentsite5 site:contentsite6 site:contentsite7)))^0.01)) > +location:australia) > +newsdate:[20061107 TO 20061208] > +region:au) > -jobsite:badsite1 -region:badregion1 -jobsite:badsite2 > -jobsite:badsite3 -jobsite:badsite4 > > Does anyone have ideas or could point us to resources that would allow > us to improve this performance? 2 seconds response added with network > latency gives an impression of "slowness" of our site that we are > trying to reduce. > > Thank you. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > For additional commands, e-mail: java-user-help@lucene.apache.org > ------------------------------------------------------ Grant Ingersoll http://www.grantingersoll.com/ --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org