hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Christoph Schmitz <Christoph.Schm...@1und1.de>
Subject AW: Performance improvement-Cluster vs Pseudo
Date Fri, 30 Mar 2012 08:46:20 GMT
Hi Ashish,

IMHO your numbers (2 machines, 10 URLs) are way too small to outweigh the natural overhead
that occurs with a distributed computation (distributing the program code, coordinating the
distributed file system, making sure everybody is starting and stopping, etc.). Also, if you're
web crawling, the bottleneck might not even be the processing capacity of your machines, but
rather some network component on the way between you and the web.

I'm not aware of any Hadoop or Nutch benchmarks, but once you use larger data and/or CPU intensive
computations, you should actually see a more or less linear increase in throughput with more


-----Ursprüngliche Nachricht-----
Von: ashish vyas [mailto:mailashishvyas@gmail.com] 
Gesendet: Freitag, 30. März 2012 10:30
An: mapreduce-user@hadoop.apache.org
Betreff: Performance improvement-Cluster vs Pseudo



	I have setup hadoop clutser(2 node cluster) and I am running Nutch crawl on it. I am trying
to compare results and improvement in processing time when I crawl with 10 URL's and depth
2. When I am running the crawl on cluster its taking more time than pseudo cluster which in
turn is taking more time than standalone nutch crawl.
	I am just wondering that after running Nutch on hadoop cluster processing time should come
down logicaly since that's why hadoop has evolved out of Nutch project. Please let me know
if there is any benchmark test for pseudo vs cluster and why Nutch crawl is taking more time
on cluster.


	Please let me know if you need more info.



	Ashish Vyas

View raw message