Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hadoop.apache.org
Received-SPF: pass (athena.apache.org: domain of harsh@cloudera.com designates
 209.85.210.174 as permitted sender)
MIME-Version: 1.0
In-Reply-To: 
 <CABKQidvAifxKB+OYp+xrJxThSa3iHoaMsbq5keCgRnKPEVmuTw@mail.gmail.com>
References: 
 <CABKQidvAifxKB+OYp+xrJxThSa3iHoaMsbq5keCgRnKPEVmuTw@mail.gmail.com>
From: Harsh J <harsh@cloudera.com>
Date: Mon, 28 Jan 2013 21:32:51 +0530
Message-ID: 
 <CAOcnVr1sY13FTFb3u4M9T2pJ99-qbUvBj06jbsWiGvn-jDm2Bg@mail.gmail.com>
Subject: Re: number of mapper tasks
To: "<user@hadoop.apache.org>" <user@hadoop.apache.org>
Content-Type: text/plain; charset=ISO-8859-1

I'm unfamiliar with EMR myself (perhaps the question fits EMR's own
boards) but here's my take anyway:

On Mon, Jan 28, 2013 at 9:24 PM, Marcelo Elias Del Valle
<mvallebr@gmail.com> wrote:
> Hello,
>
>     I am using hadoop with TextInputFormat, a mapper and no reducers. I am
> running my jobs at Amazon EMR. When I run my job, I set both following
> options:
> -s,mapred.tasktracker.map.tasks.maximum=10
> -jobconf,mapred.map.tasks=10

The first property you've given, refers to a single tasktracker's
maximum concurrency. This means, if you have 4 TaskTrackers, with this
property at each of them, then you have 40 total concurrent map slots
available in all - perhaps more than you intended to configure?

Again, this may be an EMR specific and I may be wrong, since I haven't
seen anyone pass this via CLI before and it is generally to be
configured at a service level.

The second property is more to do with your problem. MR typically
decides the number of map tasks it requires for a job, based on the
input size. In the stable API (the org.apache.hadoop.mapred one), the
mapred.map.tasks can be passed in the way you seem to be passing
above, for an input format to take it as a 'hint' to decide number of
map splits to enforce out of the input, no matter if it isn't large
enough to necessitate that many maps.

However, the new API code accepts no such config-based hints (and such
logic changes need to be done in the programs' own code).

So depending on your implementation of the job here, you may or may
not see it act in effect. Hope this helps.

>     When I run my job with just 1 instance, I see it only creates 1 mapper.
> When I run my job with 5 instances (1 master and 4 cores), I can see only 2
> mapper slots are used and 6 stay open.

Perhaps the job itself launched with 2 total map tasks? You can check
this on the JobTracker UI or whatever EMR offers as a job viewer.

>      I am trying to figure why I am not being able to run more mappers in
> parallel. When I see the logs, I find some messages like these:
>
> INFO org.apache.hadoop.mapred.ReduceTask (main):
> attempt_201301281437_0001_r_000003_0 Scheduled 0 outputs (0 slow hosts and0
> dup hosts)
> org.apache.hadoop.mapred.ReduceTask (main):
> attempt_201301281437_0001_r_000003_0 Need another 1 map output(s) where 0 is
> already in progress

This is a typical waiting reduce task log, what are you asking here
specifically?

--
Harsh J