hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Kevin Peterson <kpeter...@biz360.com>
Subject Re: How many nodes does one man want?
Date Fri, 27 Mar 2009 22:43:36 GMT
On Thu, Mar 26, 2009 at 4:38 PM, Sid123 <itissid@gmail.com> wrote:

> I am working of implementing some machine learning algorithms using Map
> Red.
> I want to know that If I have data that takes 5-6 hours to train on a
> normal
> machine. Will putting in 2-3 more nodes have an effect? I read in the yahoo
> hadoop tutorial.
> "Executing Hadoop on a limited amount of data on a small number of nodes
> may
> not demonstrate particularly stellar performance as the overhead involved
> in
> starting Hadoop programs is relatively high. Other parallel/distributed
> programming paradigms such as MPI (Message Passing Interface) may perform
> much better on two, four, or perhaps a dozen machines."
> I have at my disposal 3 laptops each with 4 G RAM and 150G hard disk space
> each...  I have 600M of training data....

I'd say don't bother. Not because adding two machines won't double your
performance (maybe it will come close) but just because of the hassle of
setting up hadoop, then having to copy data in and out of HDFS,
restructuring your code within map-reduce paradigm, and so on.

I have a machine learning task that takes about an hour on my machine. I
find this significantly more convenient than running it on hadoop, and I'm
already working within hadoop. Of course, some of this inconvenience is due
to EC2, not hadoop itself. If I could run from inside eclipse, it might be a
different story.

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message