hadoop-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Eric Sammer <esam...@cloudera.com>
Subject Re: Problem with Hadoop Streaming and -D mapred.tasktracker.map.tasks.maximum option
Date Tue, 11 May 2010 16:59:58 GMT
The short answer is that with Hadoop, you generally do not decide the
exact number of map tasks that are spawned. The number of map tasks
spawned is usually a function of the number of blocks in the input
data set. Task trackers are configured with a number of slots for map
and reduce tasks. Tasks are assigned to slots on task trackers. By
default, task trackers have 2 map slots and 2 reduce slots per task
tracker.

The manner with which Hadoop assigns tasks to task trackers is based
on a number of factors.

You can attempt to control parallelization at a micro level (as you're
doing) but it's generally a bad idea. Not only are you not taking full
advantage of your cluster, but you are not taking advantage of what
Hadoop is actually good at. In fact, it may not be possible to control
it exactly as you wish. Is there a reason why you need to control
things so strictly? Do you need exactly a multiple of the number of
nodes, or an approximation thereof? What is the rationale for wanting
to run only one task per node?

On Mon, May 10, 2010 at 10:07 AM, Corneliu-Tudor Vlad
<corneliutudor.vlad@ens-lyon.fr> wrote:
>
> Hello
>
> I am a new user of Hadoop and I have some trouble using Hadoop Streaming and
> the "-D mapred.tasktracker.map.tasks.maximum" option.
>
> I'm experimenting with an unmanaged application (C++) which I want to run
> over several nodes in 2 scenarious
> 1) the number of maps (input splits) is equal to the number of nodes
> 2) the number of maps is a multiple of the number of nodes (5, 10, 20, ...
>
> Initially, when running the tests in scenario 1 I would sometimes get 2
> process/node on half the nodes. However I fixed this by adding the directive
> -D mapred.tasktracker.map.tasks.maximum=1, so everything works fine.
>
> In the case of scenario 2 (more maps than nodes) this directive no longer
> works, always obtaining 2 processes/node. I tested the even with putting
> maximum=5 and I still get 2 processes/node.
>
> The entire command I use is:
>
> /usr/bin/time --format="-duration:\t%e |\t-MFaults:\t%F
> |\t-ContxtSwitch:\t%w" \
>  /opt/hadoop/bin/hadoop jar
> /opt/hadoop/contrib/streaming/hadoop-0.20.2-streaming.jar \
>  -D mapred.tasktracker.map.tasks.maximum=1 \
>  -D mapred.map.tasks=30 \
>  -D mapred.reduce.tasks=0 \
>  -D io.file.buffer.size=5242880 \
>  -libjars "/opt/hadoop/contrib/streaming/hadoop-7debug.jar" \
>  -input input/test553short \
>  -output out1 \
>  -mapper "/opt/jobdata/script_1k" \
>  -inputformat "me.MyInputFormat"
>
> I'm using is Debian Lenny x64, and Hadoop 0.20.2.
>
> My question is: why is this happening and how can I make it work properly
> (i.e. be able to limit exactly how many mappers I can have at 1 time per
> node)
>
> Thank you in advance,
> T
>
>



-- 
Eric Sammer
phone: +1-917-287-2675
twitter: esammer
data: www.cloudera.com

Mime
View raw message