nutch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "VishalS" <>
Subject RE: Search performance for large indexes (>100M docs)
Date Mon, 12 Jan 2009 11:49:58 GMT


   Thanks for the responses - I have received replies from Otis, Dennis,
Sean and Jay Pound (sorry if I forgot someone). To summarize what I
understood from these replies:


1.	The indices *have* to be in fast storage - it's difficult to get
great performance without this.
2.	It's worth looking into SSDs to store the indices. This would
probably help speed up the search performance, is cheaper compared to RAM
and gives almost similar performance.
3.	Jay mentioned that with Nutch 0.7, the hard drives were a bottleneck
for him. He got over the issue by using multiple (~15) small hard-drives on
a single machine, and running 10 search servers on it - he was able to get a
reasonable performance through this architecture on 2 machines.


Currently, I am unable to experiment with SSDs since my searchers are hosted
on EC2.


>From my experience so far, I am also leaning towards believing that the
query plugins would also play a very important role in the performance as
well (apart from relevance).


I will share my observations as I keep going.


Sean - good luck for the experiments you are conducting - way to go!






From: VishalS [] 
Sent: Tuesday, January 06, 2009 7:12 PM
To: ''
Subject: Search performance for large indexes (>100M docs)




  I am experimenting with a system with around 120 million documents. The
index is split into sub-indices of ~10M documents - each such index is being
searched by a single machine. The results are being aggregated using the
DistributedSearcher client. I am seeing a lot of performance issues with the
system - most of the times, the response times are >4 seconds, and in some
cases it goes upto a minute.


  It would be wonderful to know if there are ways to optimize what I am
doing, or if there is something obvious that I am doing wrong. Here's what I
have tried so far, and the issues I see:


1.	Each search server is a 64-bit Pentium machine with ~7GB RAM and 4
CPUs running Linux. However, the searcher is not able to use more than 1 GB
of RAM even though I have set -Xmx to ~3.5GB. I am guessing this is a Lucene
issue. Is there a way we can have the searcher use more RAM to speed things
2.	The total size of the index directory on each machine is ~70-100 GB.
The prx file is 50GB, the fnm and frq files are ~27GB each and the fdt file
is around 3GB. Is this too big?
3.	I have tried analyzing my documents for commonly occurring terms in
various fields, and added these terms to common-terms.utf8. There are ~10K
terms in this file for me now. I am hoping this will help me speed up any
phrase queries I am doing internally (although there is a cost attached in
terms of the number of unique terms in the Lucene index, the total index
size has increased by ~10-15%, which I guess is ok.)
4.	There are around 8 fields that are searched in for each of the words
in the query. Also, a phrase query containing all the words is fired in each
of these fields as well. This means that for a 3 word input query, the
number of sub-queries in my Lucene query are 24(3*8) term queries and 8(1*8)
3-word phrase queries. Is this too long or too expensive? 
5.	I have noticed that the slowest running queries (it takes upto a
minute sometimes) are many times the ones that have one or more common
6.	Each individual searcher has a single Lucene indexlet. Would it be
faster to have more than 1 indexlet on the machine?
7.	I am using a tomcat 6.0 installation out-of-the-box, with some minor
changes in the number of threads, the java stack size allocation. 


If there's anyone else who has had experience working with large indices, I
would love to get in touch and exchange notes.





  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message