Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 50164 invoked from network); 14 Apr 2008 14:18:38 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 14 Apr 2008 14:18:38 -0000 Received: (qmail 84977 invoked by uid 500); 14 Apr 2008 14:18:31 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 84952 invoked by uid 500); 14 Apr 2008 14:18:31 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 84941 invoked by uid 99); 14 Apr 2008 14:18:31 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 14 Apr 2008 07:18:30 -0700 X-ASF-Spam-Status: No, hits=-2.0 required=10.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_MED,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of cbmcgee@ca.ibm.com designates 32.97.182.144 as permitted sender) Received: from [32.97.182.144] (HELO e4.ny.us.ibm.com) (32.97.182.144) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 14 Apr 2008 14:17:38 +0000 Received: from d01relay04.pok.ibm.com (d01relay04.pok.ibm.com [9.56.227.236]) by e4.ny.us.ibm.com (8.13.8/8.13.8) with ESMTP id m3EEHwGq019668 for ; Mon, 14 Apr 2008 10:17:58 -0400 Received: from d01av05.pok.ibm.com (d01av05.pok.ibm.com [9.56.224.195]) by d01relay04.pok.ibm.com (8.13.8/8.13.8/NCO v8.7) with ESMTP id m3EEHwJP368944 for ; Mon, 14 Apr 2008 10:17:58 -0400 Received: from d01av05.pok.ibm.com (loopback [127.0.0.1]) by d01av05.pok.ibm.com (8.12.11.20060308/8.13.3) with ESMTP id m3EEDW2o012097 for ; Mon, 14 Apr 2008 10:13:32 -0400 Received: from d25ml04.torolab.ibm.com (d25ml04.torolab.ibm.com [9.26.6.105]) by d01av05.pok.ibm.com (8.12.11.20060308/8.12.11) with ESMTP id m3EEDWvY012078 for ; Mon, 14 Apr 2008 10:13:32 -0400 In-Reply-To: <359a92830804101318l46a4a327j2552311db6126698@mail.gmail.com> To: java-user@lucene.apache.org MIME-Version: 1.0 Subject: Re: How to improve performance of large numbers of successive searches? X-Mailer: Lotus Notes Release 7.0 HF242 April 21, 2006 From: Chris McGee X-MIMETrack: S/MIME Sign by Notes Client on Chris McGee/Ottawa/IBM(Release 7.0 HF242|April 21, 2006) at 14/04/2008 10:15:59 AM, Serialize by Notes Client on Chris McGee/Ottawa/IBM(Release 7.0 HF242|April 21, 2006) at 14/04/2008 10:15:59 AM, Serialize complete at 14/04/2008 10:15:59 AM, S/MIME Sign failed at 14/04/2008 10:15:59 AM: The cryptographic key was not found, Serialize by Router on D25ML04/25/M/IBM(Release 7.0.2HF446 | March 16, 2007) at 04/14/2008 10:17:57, Serialize complete at 04/14/2008 10:17:57 Message-ID: Date: Mon, 14 Apr 2008 10:17:30 -0400 Content-Type: multipart/alternative; boundary="=_alternative 004E5E528525742B_=" X-Virus-Checked: Checked by ClamAV on apache.org --=_alternative 004E5E528525742B_= Content-Type: text/plain; charset="US-ASCII" Hi Erick, Thanks for the information. I tried using a HitCollector and a FieldSelector. I'm getting some dramatic improvements gathering large result sets using the FieldSelector. As it turned out I was able to assume in many cases that I could break out after a specific field in each document. Assuming that I need to gather all result documents each time, what are the advantages of using a HitCollector over Hits? Is there some way that I can load the index portion of the lucene data storage into RAM without loading everything into a RAMDirectory? Thanks, Chris McGee "Erick Erickson" 10/04/2008 04:18 PM Please respond to java-user@lucene.apache.org To java-user@lucene.apache.org cc Subject Re: How to improve performance of large numbers of successive searches? >From this <<< iterate over all of the hits>>> I infer that you're using a Hits object. This is a no-no when getting more than 100 or so objects. In a nutshell, the query gets re-executed every 100 fetches. So your 2,000 hits are executing the query 20 times. The Hits object is optimized for returning the top few scoring documents rather than get the entire result set. See HitCollector/TopDocs/TopDocCollector etc. for better ways of doing this. Also, if you're calling IndexReader.document(i) for each document you'll inevitably take a lot of time as you're loading all of each document. Think about lazy field loading (see FieldSelector). Best Erick P.S. If this is totally off base, perhaps you could post some of the code you think is slow.... On Thu, Apr 10, 2008 at 2:34 PM, Chris McGee wrote: > Hello, > > I am building fairly large directories (200-500 MB of disk space) using > lucene-java. Sometimes it can take upwards of 10-15 mins to create the > documents and write them to disk using my current configuration. I have > upgraded to the latest 2.3.1 version and followed many of the > recommendations offered on the wiki: > > http://wiki.apache.org/lucene-java/ImproveIndexingSpeed > > These tips have significantly improved the time to build the directory and > search it. However, I have noticed that when I perform term queries using > a searcher many times in rapid succession and iterate over all of the hits > it can take a significant time. To perform 1000 term query searches each > with around 2000 hits it takes well over a minute. The time seems to vary > linearly based on the number of searches (ie. 10 times more searches take > 10 times longer). I tried combining the searches into a BooleanQuery but > it only shaves off a small percentage (5-10%) of the total time. > > I was wondering if there is a faster way to retrieve all of the results > for my large collections of terms without using more memory and without > taking more time to build the directory? I already looked at bypassing the > searcher and using the IndexReader.termDocs() method directly to retrieve > the documents but there did not seem to be much performance improvement. > In the majority of my cases I am simplying looking for a large number of > values to the same field. Also, I'm not interested in scoring results > based on frequency or weights I need to retrieve all of the results > anyway. > > Any help with this would be great. > > Thanks, > Chris McGee --=_alternative 004E5E528525742B_=--