hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alex Baranau <alex.barano...@gmail.com>
Subject Re: program running faster on single node than cluster
Date Thu, 18 Nov 2010 06:34:14 GMT
These config settings depend on your MR job nature and resources available
on the node. Since increasing heap size affected the time dramatically I
assume that your jobs "like" memory. Can you describe your machines? Also,
make sure you don't have any network issues (slow network can cause slowness
when switching from standalone to distributed mode).

Try to increase mapred.child.java.opts parameter in case your jobs are
memory-demanding.

Comment re mapred.tasktracker.reduce.tasks.maximum and
mapred.tasktracker.map.tasks.maximum: you don't really put there a maximum
number of map/reduce slots for the whole cluster, but maximum per node
(which runs tasktracker). I.e. if you decide e.g. that it's reasonable to
handle N map tasks on each node you should put N into
mapred.tasktracker.map.tasks.maximum, not N * nodes count.

Also, try to look at web interface (as Hari suggested) to see how many map
and reduce tasks started for you job and how many nodes are used to process
the job.

Alex Baranau
----
Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch - Hadoop - HBase

On Thu, Nov 18, 2010 at 8:19 AM, Cornelio Iñigo
<cornelio.inigof@gmail.com>wrote:

> Hi
> the cluster has 12 nodes and the master node, I made a new test increasing
> the child nodes memory to 2000m and the HADOOP_HEAP_SIZE to 2000m
> and mapred.tasktracker.map.tasks.maximum and
> mapred.tasktracker.reduce.tasks.maximum is 2 (like default) and now the
> time
> is 6 minutes, but I think it is very much time compared to the single node
> run (7 to 8 minutes)
>
> It seems to be a configuration issue but I'm not sure what values I have to
> put (for the 12 nodes cluster).
> Bibliography says that
>  mapred.tasktracker.map.tasks.maximum between 10 and 100 maps/node
>  mapred.tasktracker.reduce.tasks.maximum  1.75 * nodes
>
> or
>
> mapred.tasktracker.map.tasks.maximum = 10 * #slaves
>  mapred.tasktracker.reduce.tasks.maximum  2 * #slaves processors
>
> Thanks
>
>
>
> 2010/11/17 Alex Baranau <alex.baranov.v@gmail.com>
>
> > How many nodes do you use for you "fully distributed" cluster?
> >
> > Alex Baranau
> > ----
> > Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch - Hadoop -
> HBase
> >
> > On Wed, Nov 17, 2010 at 5:44 AM, Cornelio Iñigo
> > <cornelio.inigof@gmail.com>wrote:
> >
> > > Hi
> > >
> > > I have a question to you:
> > >
> > > I developed a program using Hadoop, it has one map function and one
> > reduce
> > > function (like WordCount) and in the map function I do all the process
> of
> > > my
> > > data
> > > when I run this program in a single node machine it takes like 7
> minutes
> > > (its a small dataset), in a pseudo-distributed machine takes like 7
> > minutes
> > > too, but when I run it on a
> > > full distributed cluster (12 nodes) it takes much longer, like an
> hour!!
> > >
> > > I tried changing the mapred.tasktracker.map.tasks.maximum and
> > > mapred.tasktracker.reduce.tasks.maximum variables (2 and 2 like
> default,
> > 10
> > > and 2, 2 and 10, 5 and 5) and the results are the same
> > > Am I missing something?
> > > Is this a cluster configuration issue or is in my program?
> > >
> > > Thanks
> > >
> > > --
> > > *Cornelio*
> > >
> >
>
>
>
> --
> *Cornelio*
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message