hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Harsh J <ha...@cloudera.com>
Subject Re: number of mapper tasks
Date Mon, 28 Jan 2013 16:41:02 GMT
Hi again,


On Mon, Jan 28, 2013 at 10:01 PM, Marcelo Elias Del Valle
<mvallebr@gmail.com> wrote:
> Hello Harsh,
>     First of all, thanks for the answer!
> 2013/1/28 Harsh J <harsh@cloudera.com>
>> So depending on your implementation of the job here, you may or may
>> not see it act in effect. Hope this helps.
> Is there anything I can do in my job, my code or in my inputFormat so that
> hadoop would choose to run more mappers? My text file and 10 million lines
> and each mapper task process 1 line at a time, very fastly. I would like to
> have 40 threads in parallel or even more processing those lines.

This seems CPU-oriented. You probably want the NLineInputFormat? See
This should let you spawn more maps as we, based on your N factor.

>> >     When I run my job with just 1 instance, I see it only creates 1
>> > mapper.
>> > When I run my job with 5 instances (1 master and 4 cores), I can see
>> > only 2
>> > mapper slots are used and 6 stay open.
>> Perhaps the job itself launched with 2 total map tasks? You can check
>> this on the JobTracker UI or whatever EMR offers as a job viewer.
> I am trying to figure this out. Here is what I have from EMR:
> http://mvalle.com/downloads/hadoop_monitor.png
> I will try to get their support to understand this, but I didn't understand
> what you said about the job being launched with 2 total map tasks... if I
> have 8 slots, shouldn't all of them be filled always?

Not really - "Slots" are capacities, rather than split factors
themselves. You can have N slots always available, but your job has to
supply as many map tasks (based on its input/needs/etc.) to use them

>> This is a typical waiting reduce task log, what are you asking here
>> specifically?
> I have no reduce tasks. My map does the job without putting anything in the
> output. Is it happening because reduce tasks receive nothing as input?

Unless your job sets the number of reducers to 0 manually, 1 default
reducer is always run that waits to see if it has any outputs from
maps. If it does not receive any outputs after maps have all
completed, it dies out with behavior equivalent to a NOP.

Hope this helps!

Harsh J

View raw message