hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Segel <michael_se...@hotmail.com>
Subject RE: program running faster on single node than cluster
Date Thu, 18 Nov 2010 16:26:55 GMT


There could be a couple of reasons for this...
If you have a small data set, like it fits in to a single block, if you have 12 nodes each
w 10 mappers, that's 120 potential mappers going against the same block, right?
Assuming that you're splitting the file in to 120 pieces.

Tuning: How big are your nodes? How many cores and how much memory?

While some tune to the number of virtual cores, I tend to be conservative and tune to the
actual number of cores.

The reason is that you have to consider how much memory you have on the box. Since we run
Hbase along with TT and DN, memory gets used pretty quickly.
If your nodes are swapping... that could hurt too.

Just an example... suppose you have 8 GB on a 8 core box. You set up TT and DN each with 1GB.
So you have 6GB left. Assuming 1GB per mapper/reducer, you're looking at 4mapper/2reducer
per node.  Again, YMMV (Your mileage may vary) and these are just a rough example. Its always
safer to start with a lower number and monitor your system via Ganglia to see how to tune
(Cause you really, really, ... really don't want to swap.



> Date: Wed, 17 Nov 2010 14:53:12 +0530
> Subject: Re: program running faster on single node than cluster
> From: hsreekumar@clickable.com
> To: common-user@hadoop.apache.org
> Are all the nodes being used? Go to <master>:50030 on the web interface
> after starting the job, and check whether the tasks are progressing together
> on all nodes or not.
> hari
> On Wed, Nov 17, 2010 at 9:14 AM, Cornelio IƱigo
> <cornelio.inigof@gmail.com>wrote:
> > Hi
> >
> > I have a question to you:
> >
> > I developed a program using Hadoop, it has one map function and one reduce
> > function (like WordCount) and in the map function I do all the process of
> > my
> > data
> > when I run this program in a single node machine it takes like 7 minutes
> > (its a small dataset), in a pseudo-distributed machine takes like 7 minutes
> > too, but when I run it on a
> > full distributed cluster (12 nodes) it takes much longer, like an hour!!
> >
> > I tried changing the mapred.tasktracker.map.tasks.maximum and
> > mapred.tasktracker.reduce.tasks.maximum variables (2 and 2 like default, 10
> > and 2, 2 and 10, 5 and 5) and the results are the same
> > Am I missing something?
> > Is this a cluster configuration issue or is in my program?
> >
> > Thanks
> >
> > --
> > *Cornelio*
> >
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message