hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From ashish vyas <mailashishv...@gmail.com>
Subject Re: Performance improvement-Cluster vs Pseudo
Date Fri, 30 Mar 2012 09:25:35 GMT
@ Christoph: Thanks for replying. I would try with more nodes/larger url
set to see how much improvement in processing time i get from cluster.

@mapreduce-mailing-community: It would be great if anybody can help me with
Nutch benchmark on small cluster since it would help me in determining no.
of machines i would need for my application to scale up.

Ashish Vyas
On Fri, Mar 30, 2012 at 2:16 PM, Christoph Schmitz <
Christoph.Schmitz@1und1.de> wrote:

> Hi Ashish,
> IMHO your numbers (2 machines, 10 URLs) are way too small to outweigh the
> natural overhead that occurs with a distributed computation (distributing
> the program code, coordinating the distributed file system, making sure
> everybody is starting and stopping, etc.). Also, if you're web crawling,
> the bottleneck might not even be the processing capacity of your machines,
> but rather some network component on the way between you and the web.
> I'm not aware of any Hadoop or Nutch benchmarks, but once you use larger
> data and/or CPU intensive computations, you should actually see a more or
> less linear increase in throughput with more machines.
> Regards,
> Christoph
> -----Ursprüngliche Nachricht-----
> Von: ashish vyas [mailto:mailashishvyas@gmail.com]
> Gesendet: Freitag, 30. März 2012 10:30
> An: mapreduce-user@hadoop.apache.org
> Betreff: Performance improvement-Cluster vs Pseudo
>        Hi,
>        I have setup hadoop clutser(2 node cluster) and I am running Nutch
> crawl on it. I am trying to compare results and improvement in processing
> time when I crawl with 10 URL's and depth 2. When I am running the crawl on
> cluster its taking more time than pseudo cluster which in turn is taking
> more time than standalone nutch crawl.
>        I am just wondering that after running Nutch on hadoop cluster
> processing time should come down logicaly since that's why hadoop has
> evolved out of Nutch project. Please let me know if there is any benchmark
> test for pseudo vs cluster and why Nutch crawl is taking more time on
> cluster.
>        Please let me know if you need more info.
>        Regards:
>        Ashish Vyas

View raw message