hadoop-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Corneliu-Tudor Vlad <corneliutudor.v...@ens-lyon.fr>
Subject Re: Problem with Hadoop Streaming and -D mapred.tasktracker.map.tasks.maximum option
Date Tue, 11 May 2010 20:26:50 GMT

Thank you for the answer, it is a little clearer now. If you could  
point me to some additional reading, I will be grateful.

In the meantime I created an issue on Jira [ MAPREDUCE-1781 ] and I  
both received answer there and provided insight on my objective.

As I understand from both your answer and Hemanth's on Jira, I am  
using Hadoop in a non-standard way. The reason is that I am performing  
some test on the feasibility of using Hadoop as a parallelization  
framework for a highly-CPU & memory bounded application, but only from  
the point of view of the distributed computed, not multicore. That is  
why I only want 1 process at a time.

Additionally I will test it on a heterogenous datacenter, possibly  
with both dual cores and quads, thus even if I use 2 mappers at once,  
I won't fully use the power of the cluster (from what I understand).

 From what I tested today, my intended approach works with the  
tasks.maximum option in the config file at startup.

Thank you,

Quoting Eric Sammer <esammer@cloudera.com>:

> The short answer is that with Hadoop, you generally do not decide the
> exact number of map tasks that are spawned. The number of map tasks
> spawned is usually a function of the number of blocks in the input
> data set. Task trackers are configured with a number of slots for map
> and reduce tasks. Tasks are assigned to slots on task trackers. By
> default, task trackers have 2 map slots and 2 reduce slots per task
> tracker.
> The manner with which Hadoop assigns tasks to task trackers is based
> on a number of factors.
> You can attempt to control parallelization at a micro level (as you're
> doing) but it's generally a bad idea. Not only are you not taking full
> advantage of your cluster, but you are not taking advantage of what
> Hadoop is actually good at. In fact, it may not be possible to control
> it exactly as you wish. Is there a reason why you need to control
> things so strictly? Do you need exactly a multiple of the number of
> nodes, or an approximation thereof? What is the rationale for wanting
> to run only one task per node?
> On Mon, May 10, 2010 at 10:07 AM, Corneliu-Tudor Vlad
> <corneliutudor.vlad@ens-lyon.fr> wrote:
>> Hello
>> I am a new user of Hadoop and I have some trouble using Hadoop Streaming and
>> the "-D mapred.tasktracker.map.tasks.maximum" option.
>> I'm experimenting with an unmanaged application (C++) which I want to run
>> over several nodes in 2 scenarious
>> 1) the number of maps (input splits) is equal to the number of nodes
>> 2) the number of maps is a multiple of the number of nodes (5, 10, 20, ...
>> Initially, when running the tests in scenario 1 I would sometimes get 2
>> process/node on half the nodes. However I fixed this by adding the directive
>> -D mapred.tasktracker.map.tasks.maximum=1, so everything works fine.
>> In the case of scenario 2 (more maps than nodes) this directive no longer
>> works, always obtaining 2 processes/node. I tested the even with putting
>> maximum=5 and I still get 2 processes/node.
>> The entire command I use is:
>> /usr/bin/time --format="-duration:\t%e |\t-MFaults:\t%F
>> |\t-ContxtSwitch:\t%w" \
>>  /opt/hadoop/bin/hadoop jar
>> /opt/hadoop/contrib/streaming/hadoop-0.20.2-streaming.jar \
>>  -D mapred.tasktracker.map.tasks.maximum=1 \
>>  -D mapred.map.tasks=30 \
>>  -D mapred.reduce.tasks=0 \
>>  -D io.file.buffer.size=5242880 \
>>  -libjars "/opt/hadoop/contrib/streaming/hadoop-7debug.jar" \
>>  -input input/test553short \
>>  -output out1 \
>>  -mapper "/opt/jobdata/script_1k" \
>>  -inputformat "me.MyInputFormat"
>> I'm using is Debian Lenny x64, and Hadoop 0.20.2.
>> My question is: why is this happening and how can I make it work properly
>> (i.e. be able to limit exactly how many mappers I can have at 1 time per
>> node)
>> Thank you in advance,
>> T
> --
> Eric Sammer
> phone: +1-917-287-2675
> twitter: esammer
> data: www.cloudera.com

View raw message