Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 17736 invoked from network); 19 Mar 2010 20:57:22 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 19 Mar 2010 20:57:22 -0000 Received: (qmail 1821 invoked by uid 500); 19 Mar 2010 20:57:20 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 1786 invoked by uid 500); 19 Mar 2010 20:57:20 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 1778 invoked by uid 99); 19 Mar 2010 20:57:20 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 19 Mar 2010 20:57:20 +0000 X-ASF-Spam-Status: No, hits=0.7 required=10.0 tests=SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (nike.apache.org: local policy) Received: from [132.204.246.20] (HELO chene.dit.umontreal.ca) (132.204.246.20) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 19 Mar 2010 20:57:10 +0000 Received: from mail.lexum.com (gw-mail.lexum.umontreal.ca [132.204.136.52]) by chene.dit.umontreal.ca (8.14.1/8.14.1) with ESMTP id o2JKuRrH020605 for ; Fri, 19 Mar 2010 16:56:27 -0400 Received: from localhost (localhost.localdomain [127.0.0.1]) by mail.lexum.com (Postfix) with ESMTP id 985DA19C4002 for ; Fri, 19 Mar 2010 16:56:27 -0400 (EDT) X-Virus-Scanned: amavisd-new at lexum.com Received: from mail.lexum.com ([127.0.0.1]) by localhost (mail.lexum.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id jspBa5ss2l8o for ; Fri, 19 Mar 2010 16:56:27 -0400 (EDT) Received: from mail.lexum.com (mail.lexum.com [192.168.3.37]) by mail.lexum.com (Postfix) with ESMTP id 5625E19C4001 for ; Fri, 19 Mar 2010 16:56:27 -0400 (EDT) Date: Fri, 19 Mar 2010 16:56:27 -0400 (EDT) From: Daniel Shane To: java-user@lucene.apache.org Message-ID: <794383296.2680.1269032187285.JavaMail.root@vicenza.dmz.lexum.pri> In-Reply-To: <2077179754.2678.1269032018735.JavaMail.root@vicenza.dmz.lexum.pri> Subject: PhraseQuery Performance Issues [Lucene 2.9.0] MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Originating-IP: [192.168.4.150] X-Mailer: Zimbra 6.0.5_GA_2213.RHEL5_64 (ZimbraWebClient - FF3.0 (Linux)/6.0.5_GA_2213.RHEL5_64) X-NAI-Spam-Score: 1 X-NAI-Spam-Rules: 2 Rules triggered JAVAMAIL_FROM_UCDIG=1, RV3495=0 X-NAI-Spam-Level: * X-Virus-Checked: Checked by ClamAV on apache.org I'm running a medium size web search with a index size just shy of 9GB with 800000 docs in it. We are suing Lucene version 2.9.0 (we have not checked yet to see if this applies to older versions as well). By looking at my logs, I'm finding that phrase queries are especially long to perform. In our index, we do not remove stopwords, so things like "the" and "is" are getting indexed on purpose. If I try a phrase search like "The The" it will take about 10 seconds in Luke to get some results back, and a bit less afterwards (7sec). More complete phrases that match maybe only 1 document can also take >10 secs if they have many stopwords in them. I was wondering if this a normal behavior considering the fact that we do not remove stopwords? Also, on some phrase queries (not all), the difference between the first call and any subsequent calls can be very big. For example, it could take 5 seconds to do one query and then less than 1 second to perform it again. Does Lucene, by default, cache anything when a (phrase) query is made or is this simply file system caching at work? If this is a normal behavior, I assume that the solution is either to remove stopwords from the index or shard it and ParallelMultiSearch it. What do you think? Daniel Shane --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org