Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: java-user@lucene.apache.org
Received-SPF: pass (asf.osuosl.org: local policy)
Message-ID: <43903645.6000705@getopt.org>
Date: Fri, 02 Dec 2005 12:55:49 +0100
From: Andrzej Bialecki <ab@getopt.org>
User-Agent: Mozilla Thunderbird 1.0.7 (Windows/20050923)
MIME-Version: 1.0
To: java-user@lucene.apache.org
Subject: Lucene performance bottlenecks
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit

Hi,

I'm doing some performance profiling of a Nutch installation, working 
with relatively large individual indexes (10 mln docs), and I'm puzzled 
with the results.

Here's the listing of the index:
-rw-r--r--  1 andrzej andrzej     9803100 Dec  2 05:24 _0.f0
-rw-r--r--  1 andrzej andrzej     9803100 Dec  2 05:24 _0.f1
-rw-r--r--  1 andrzej andrzej     9803100 Dec  2 05:24 _0.f2
-rw-r--r--  1 andrzej andrzej     9803100 Dec  2 05:24 _0.f3
-rw-r--r--  1 andrzej andrzej     9803100 Dec  2 05:24 _0.f4
-rw-r--r--  1 andrzej andrzej     9803100 Dec  2 05:24 _0.f5
-rw-r--r--  1 andrzej andrzej     9803100 Dec  2 05:25 _0.f6
-rw-r--r--  1 andrzej andrzej     9803100 Dec  2 05:25 _0.f7
-rw-r--r--  1 andrzej andrzej     9803100 Dec  2 05:25 _0.f8
-rw-r--r--  1 andrzej andrzej  2494445020 Dec  2 04:58 _0.fdt
-rw-r--r--  1 andrzej andrzej    78424800 Dec  2 04:58 _0.fdx
-rw-r--r--  1 andrzej andrzej          92 Dec  2 04:55 _0.fnm
-rw-r--r--  1 andrzej andrzej  7436259508 Dec  2 05:24 _0.frq
-rw-r--r--  1 andrzej andrzej 12885589796 Dec  2 05:24 _0.prx
-rw-r--r--  1 andrzej andrzej     3483642 Dec  2 05:24 _0.tii
-rw-r--r--  1 andrzej andrzej   280376933 Dec  2 05:24 _0.tis
-rw-r--r--  1 andrzej andrzej           4 Dec  2 05:25 deletable
-rw-r--r--  1 andrzej andrzej          27 Dec  2 05:25 segments


I run it on an AMD Opteron 246, 2Ghz, 4GB RAM, java -version says:

Java HotSpot(TM) 64-Bit Server VM (build 1.5.0_05-b05, mixed mode)

I run it with a heap of 1.5-2.5 GB, which doesn't make any difference 
(see below). I'm using the latest SVN code (from yesterday) + 
performance enhancements to ConjunctionScorer and BooleanScorer2 from JIRA.

The performance is less than impressive, response times being more than 
1 sec. Nutch produces complex queries for phrases, so the user query 
"term1 term2" gets rewritten like this:

+(url:term1^4.0 anchor:term1^2.0 content:term1 title:term1^1.5 
host:term1^2.0) +(url:term2^4.0 anchor:term2^2.0 content:term2 
title:term2^1.5 host:term2^2.0) url:"term1 term2"~2147483647^4.0 
anchor:"term1 term2"~4^2.0 content:"term1 term2"~2147483647 title:"term1 
term2"~2147483647^1.5 host:"term1 term2"~2147483647^2.0

For a simple TermQuery, if the DF(term) is above 10%, the response time 
from IndexSearcher.search() is around 400ms (repeatable, after warm-up). 
For such complex phrase queries the response time is around 1 sec or 
more (again, after warm-up).

Initially I thought the process is I/O or heap/GC bound, this is a large 
index after all, but the profiler shows it's purely CPU bound. I tracked 
the bottleneck to the scorers (see my previous email on this), but also 
to IndexInput.readVInt.. What's even more curious, most of the heap is 
unused - I had the impression that Lucene tries to read as much of the 
index as it can to memory in order to speed up the access, but 
apparently that's not the case. The heap consumption was always in the 
order of 100-200MB, no matter how large heap I set (and I tried values 
between 1-4GB).

For those interested in profiler info, look here:

http://www.getopt.org/lucene/20051202/

Here's an example of elapsed times [ms] for IndexSearcher.search, and 
for getting the first 100 docs using Hits.doc(i):

19. Complex search1:
 search: 1309
 hits.doc: 4
19. Complex search2:
 search: 2492
 hits.doc: 5
19. Simple search:
 search: 392
 hits.doc: 5
20. Complex search1:
 search: 1307
 hits.doc: 5
20. Complex search2:
 search: 2499
 hits.doc: 5
20. Simple search:
 search: 391
 hits.doc: 5


I would appreciate any suggestions how to proceed with this...

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org