Return-Path: X-Original-To: apmail-hadoop-mapreduce-user-archive@minotaur.apache.org Delivered-To: apmail-hadoop-mapreduce-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id EAF73EC58 for ; Mon, 28 Jan 2013 16:03:43 +0000 (UTC) Received: (qmail 36353 invoked by uid 500); 28 Jan 2013 16:03:39 -0000 Delivered-To: apmail-hadoop-mapreduce-user-archive@hadoop.apache.org Received: (qmail 36173 invoked by uid 500); 28 Jan 2013 16:03:38 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 36166 invoked by uid 99); 28 Jan 2013 16:03:38 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 28 Jan 2013 16:03:38 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of harsh@cloudera.com designates 209.85.210.174 as permitted sender) Received: from [209.85.210.174] (HELO mail-ia0-f174.google.com) (209.85.210.174) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 28 Jan 2013 16:03:32 +0000 Received: by mail-ia0-f174.google.com with SMTP id o25so4373322iad.19 for ; Mon, 28 Jan 2013 08:03:11 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113; h=x-received:mime-version:in-reply-to:references:from:date:message-id :subject:to:content-type:x-gm-message-state; bh=VozH+eJ4vfvjy7CrWlycul4i7hBEDQRajSW6nPIF8i8=; b=pigQ6FviJyBEp4vD9/UJlSj1EnNo32b8Kkl/u6cwNaUE5SbqLa2Be0JIZo/3TyFSFK wKQPBfUiaMaLefw0stE+hJ2/7f0QGrhN4myIyWO7eUQbU2TQzTc1yvlUY84p961cv3we jyk7tHeXSEivdn/pkelf2c0OunRv3y4+wvCrAA1rsWVZs2PkLchvlHLlR3GF81iJ2aR9 2AS6IO9oITJio83IUNlq5228bfwJJYCdV1WvcnfeXlqE23F+dXjsR4OrnDXtDoh8JUCn DCr2KcRHSLiV+RuZwyzjqtsMeXY8MwUQ2hdGhqhsM1HiwS0nJk6BQ7Earq0ruMS/PRsb chLw== X-Received: by 10.42.102.71 with SMTP id h7mr3115430ico.44.1359388991656; Mon, 28 Jan 2013 08:03:11 -0800 (PST) MIME-Version: 1.0 Received: by 10.64.9.226 with HTTP; Mon, 28 Jan 2013 08:02:51 -0800 (PST) In-Reply-To: References: From: Harsh J Date: Mon, 28 Jan 2013 21:32:51 +0530 Message-ID: Subject: Re: number of mapper tasks To: "" Content-Type: text/plain; charset=ISO-8859-1 X-Gm-Message-State: ALoCoQlyHcayj+COCzVNc1b3PDlVuzlmn8H/7tvQt6x1V3vAQ/B+5LhpBi2Jp+E9fmMVGH+NOg4l X-Virus-Checked: Checked by ClamAV on apache.org I'm unfamiliar with EMR myself (perhaps the question fits EMR's own boards) but here's my take anyway: On Mon, Jan 28, 2013 at 9:24 PM, Marcelo Elias Del Valle wrote: > Hello, > > I am using hadoop with TextInputFormat, a mapper and no reducers. I am > running my jobs at Amazon EMR. When I run my job, I set both following > options: > -s,mapred.tasktracker.map.tasks.maximum=10 > -jobconf,mapred.map.tasks=10 The first property you've given, refers to a single tasktracker's maximum concurrency. This means, if you have 4 TaskTrackers, with this property at each of them, then you have 40 total concurrent map slots available in all - perhaps more than you intended to configure? Again, this may be an EMR specific and I may be wrong, since I haven't seen anyone pass this via CLI before and it is generally to be configured at a service level. The second property is more to do with your problem. MR typically decides the number of map tasks it requires for a job, based on the input size. In the stable API (the org.apache.hadoop.mapred one), the mapred.map.tasks can be passed in the way you seem to be passing above, for an input format to take it as a 'hint' to decide number of map splits to enforce out of the input, no matter if it isn't large enough to necessitate that many maps. However, the new API code accepts no such config-based hints (and such logic changes need to be done in the programs' own code). So depending on your implementation of the job here, you may or may not see it act in effect. Hope this helps. > When I run my job with just 1 instance, I see it only creates 1 mapper. > When I run my job with 5 instances (1 master and 4 cores), I can see only 2 > mapper slots are used and 6 stay open. Perhaps the job itself launched with 2 total map tasks? You can check this on the JobTracker UI or whatever EMR offers as a job viewer. > I am trying to figure why I am not being able to run more mappers in > parallel. When I see the logs, I find some messages like these: > > INFO org.apache.hadoop.mapred.ReduceTask (main): > attempt_201301281437_0001_r_000003_0 Scheduled 0 outputs (0 slow hosts and0 > dup hosts) > org.apache.hadoop.mapred.ReduceTask (main): > attempt_201301281437_0001_r_000003_0 Need another 1 map output(s) where 0 is > already in progress This is a typical waiting reduce task log, what are you asking here specifically? -- Harsh J