Return-Path: Delivered-To: apmail-hadoop-core-user-archive@www.apache.org Received: (qmail 98647 invoked from network); 3 Jul 2008 07:56:19 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 3 Jul 2008 07:56:19 -0000 Received: (qmail 45407 invoked by uid 500); 3 Jul 2008 07:56:12 -0000 Delivered-To: apmail-hadoop-core-user-archive@hadoop.apache.org Received: (qmail 45369 invoked by uid 500); 3 Jul 2008 07:56:12 -0000 Mailing-List: contact core-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: core-user@hadoop.apache.org Delivered-To: mailing list core-user@hadoop.apache.org Received: (qmail 45349 invoked by uid 99); 3 Jul 2008 07:56:12 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 03 Jul 2008 00:56:12 -0700 X-ASF-Spam-Status: No, hits=1.2 required=10.0 tests=SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (athena.apache.org: local policy) Received: from [203.99.254.143] (HELO rsmtp1.corp.hki.yahoo.com) (203.99.254.143) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 03 Jul 2008 07:55:10 +0000 Received: from [10.66.74.35] (sevenshade-lr.eglbp.corp.yahoo.com [10.66.74.35]) by rsmtp1.corp.hki.yahoo.com (8.13.8/8.13.8/y.rout) with ESMTP id m637rwfc013795 for ; Thu, 3 Jul 2008 00:53:59 -0700 (PDT) DomainKey-Signature: a=rsa-sha1; s=serpent; d=yahoo-inc.com; c=nofws; q=dns; h=message-id:date:from:user-agent:mime-version:to:subject: references:in-reply-to:content-type:content-transfer-encoding; b=S2GVu1tTf4uTen91rdBRPioK72pjVieMIiLbsHL1Dd0Sy7ccpNCg62ml5fCSNpsf Message-ID: <486C8596.1010109@yahoo-inc.com> Date: Thu, 03 Jul 2008 13:23:58 +0530 From: Amar Kamat User-Agent: Thunderbird 2.0.0.14 (X11/20080421) MIME-Version: 1.0 To: core-user@hadoop.apache.org Subject: Re: scaling issue, please help References: <560DF7D2-5A04-4308-8B9E-4C5F4319788C@apple.com> <486B0CD4.7020602@yahoo-inc.com> <7FC5A2B6-BDF8-47A7-AA4A-ADB8F1013844@apple.com> In-Reply-To: <7FC5A2B6-BDF8-47A7-AA4A-ADB8F1013844@apple.com> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org Mori Bellamy wrote: > i discovered that some of my code was causing out of bounds > exceptions. i cleaned up that code and the map tasks seemed to work. > that confuses me -- i'm pretty sure hadoop is resilient to a few map > tasks failing (5 out of 13k). before this fix, my remaining 2% of > tasks were getting killed. Mori, I am not sure what the confusion is. Hadoop is resilient to few task failures but not by default. The parameter that does it is mapred.max.map.failures.percent and mapred.max.reduce.failures.percent. Every task internally consists of attempts (internally, for the framework). Hadoop allows some attempt failures too. If the number of attempts that failed of a task exceeds the threshold (mapred.map.max.attempts/mapred.reduce.max.attempts : default is 4) then the task is considered failed. If the number of map/reduce task failures exceeds the threshold (mapred.max.map.failures.percent/mapred.max.reduce.failures.percent : default is 0) then the job is considered failed. Amar > > > On Jul 1, 2008, at 10:06 PM, Amar Kamat wrote: > >> Mori Bellamy wrote: >>> hey all, >>> i've got a mapreduce task that works on small (~1G) input. when i >>> try to run the same task on large (~100G) input, i get the following >>> error around when the map tasks are almost done (~98%) >>> >>> 2008-07-01 13:10:59,231 INFO org.apache.hadoop.mapred.ReduceTask: >>> task_200807011005_0005_r_000000_0: Got 0 new map-outputs & 0 >>> obsolete map-outputs from tasktracker and 0 map-outputs from >>> previous failures >>> 2008-07-01 13:10:59,232 INFO org.apache.hadoop.mapred.ReduceTask: >>> task_200807011005_0005_r_000000_0 Got 0 known map output >>> location(s); scheduling... >>> 2008-07-01 13:10:59,232 INFO org.apache.hadoop.mapred.ReduceTask: >>> task_200807011005_0005_r_000000_0 Scheduled 0 of 0 known outputs (0 >>> slow hosts and 0 dup hosts) >>> 2008-07-01 13:10:59,232 INFO org.apache.hadoop.mapred.ReduceTask: >>> task_200807011005_0005_r_000000_0 Need 1 map output(s) >> ... >> ... >> These are not error messages. The reducers are stuck as not all maps >> are completed. Mori, could you let us know what is happening to the >> other 2% maps. Are they getting executed? Are they still pending >> (waiting to run)? Were they killed/failed? Is there any lost tracker? >>> I'm running the task on a cluster of 5 workers, one DFS master, and >>> one task tracker. >> What do you mean by 5 workers and 1 task tracker? >>> i'm chaining mapreduce tasks, so i'm using SequenceFileOutput and >>> SequenceFileInput. this error happens before the first link in the >>> chain sucessfully reduces. >> Can you elaborate this a bit. Are you chaining MR jobs? >> Amar >>> >>> does anyone have any insight? thanks! >> >