Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 95566 invoked from network); 2 Dec 2005 11:54:10 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (209.237.227.199) by minotaur.apache.org with SMTP; 2 Dec 2005 11:54:10 -0000 Received: (qmail 58858 invoked by uid 500); 2 Dec 2005 11:54:05 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 58549 invoked by uid 500); 2 Dec 2005 11:54:03 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 58538 invoked by uid 99); 2 Dec 2005 11:54:03 -0000 Received: from asf.osuosl.org (HELO asf.osuosl.org) (140.211.166.49) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 02 Dec 2005 03:54:03 -0800 X-ASF-Spam-Status: No, hits=0.0 required=10.0 tests= X-Spam-Check-By: apache.org Received-SPF: pass (asf.osuosl.org: local policy) Received: from [69.44.16.11] (HELO getopt.org) (69.44.16.11) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 02 Dec 2005 03:55:32 -0800 Received: from [192.168.0.252] (75-mo3-2.acn.waw.pl [62.121.105.75]) (authenticated) by getopt.org (8.11.6/8.11.6) with ESMTP id jB2BrhV13301 for ; Fri, 2 Dec 2005 05:53:46 -0600 Message-ID: <43903645.6000705@getopt.org> Date: Fri, 02 Dec 2005 12:55:49 +0100 From: Andrzej Bialecki User-Agent: Mozilla Thunderbird 1.0.7 (Windows/20050923) X-Accept-Language: en-us, en MIME-Version: 1.0 To: java-user@lucene.apache.org Subject: Lucene performance bottlenecks Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org X-Spam-Rating: minotaur.apache.org 1.6.2 0/1000/N Hi, I'm doing some performance profiling of a Nutch installation, working with relatively large individual indexes (10 mln docs), and I'm puzzled with the results. Here's the listing of the index: -rw-r--r-- 1 andrzej andrzej 9803100 Dec 2 05:24 _0.f0 -rw-r--r-- 1 andrzej andrzej 9803100 Dec 2 05:24 _0.f1 -rw-r--r-- 1 andrzej andrzej 9803100 Dec 2 05:24 _0.f2 -rw-r--r-- 1 andrzej andrzej 9803100 Dec 2 05:24 _0.f3 -rw-r--r-- 1 andrzej andrzej 9803100 Dec 2 05:24 _0.f4 -rw-r--r-- 1 andrzej andrzej 9803100 Dec 2 05:24 _0.f5 -rw-r--r-- 1 andrzej andrzej 9803100 Dec 2 05:25 _0.f6 -rw-r--r-- 1 andrzej andrzej 9803100 Dec 2 05:25 _0.f7 -rw-r--r-- 1 andrzej andrzej 9803100 Dec 2 05:25 _0.f8 -rw-r--r-- 1 andrzej andrzej 2494445020 Dec 2 04:58 _0.fdt -rw-r--r-- 1 andrzej andrzej 78424800 Dec 2 04:58 _0.fdx -rw-r--r-- 1 andrzej andrzej 92 Dec 2 04:55 _0.fnm -rw-r--r-- 1 andrzej andrzej 7436259508 Dec 2 05:24 _0.frq -rw-r--r-- 1 andrzej andrzej 12885589796 Dec 2 05:24 _0.prx -rw-r--r-- 1 andrzej andrzej 3483642 Dec 2 05:24 _0.tii -rw-r--r-- 1 andrzej andrzej 280376933 Dec 2 05:24 _0.tis -rw-r--r-- 1 andrzej andrzej 4 Dec 2 05:25 deletable -rw-r--r-- 1 andrzej andrzej 27 Dec 2 05:25 segments I run it on an AMD Opteron 246, 2Ghz, 4GB RAM, java -version says: Java HotSpot(TM) 64-Bit Server VM (build 1.5.0_05-b05, mixed mode) I run it with a heap of 1.5-2.5 GB, which doesn't make any difference (see below). I'm using the latest SVN code (from yesterday) + performance enhancements to ConjunctionScorer and BooleanScorer2 from JIRA. The performance is less than impressive, response times being more than 1 sec. Nutch produces complex queries for phrases, so the user query "term1 term2" gets rewritten like this: +(url:term1^4.0 anchor:term1^2.0 content:term1 title:term1^1.5 host:term1^2.0) +(url:term2^4.0 anchor:term2^2.0 content:term2 title:term2^1.5 host:term2^2.0) url:"term1 term2"~2147483647^4.0 anchor:"term1 term2"~4^2.0 content:"term1 term2"~2147483647 title:"term1 term2"~2147483647^1.5 host:"term1 term2"~2147483647^2.0 For a simple TermQuery, if the DF(term) is above 10%, the response time from IndexSearcher.search() is around 400ms (repeatable, after warm-up). For such complex phrase queries the response time is around 1 sec or more (again, after warm-up). Initially I thought the process is I/O or heap/GC bound, this is a large index after all, but the profiler shows it's purely CPU bound. I tracked the bottleneck to the scorers (see my previous email on this), but also to IndexInput.readVInt.. What's even more curious, most of the heap is unused - I had the impression that Lucene tries to read as much of the index as it can to memory in order to speed up the access, but apparently that's not the case. The heap consumption was always in the order of 100-200MB, no matter how large heap I set (and I tried values between 1-4GB). For those interested in profiler info, look here: http://www.getopt.org/lucene/20051202/ Here's an example of elapsed times [ms] for IndexSearcher.search, and for getting the first 100 docs using Hits.doc(i): 19. Complex search1: search: 1309 hits.doc: 4 19. Complex search2: search: 2492 hits.doc: 5 19. Simple search: search: 392 hits.doc: 5 20. Complex search1: search: 1307 hits.doc: 5 20. Complex search2: search: 2499 hits.doc: 5 20. Simple search: search: 391 hits.doc: 5 I would appreciate any suggestions how to proceed with this... -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __________________________________ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org