Return-Path: X-Original-To: apmail-hadoop-common-commits-archive@www.apache.org Delivered-To: apmail-hadoop-common-commits-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 7E2915ADA for ; Tue, 10 May 2011 07:11:57 +0000 (UTC) Received: (qmail 55302 invoked by uid 500); 10 May 2011 07:11:57 -0000 Delivered-To: apmail-hadoop-common-commits-archive@hadoop.apache.org Received: (qmail 54840 invoked by uid 500); 10 May 2011 07:11:56 -0000 Mailing-List: contact common-commits-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: common-dev@hadoop.apache.org Delivered-To: mailing list common-commits@hadoop.apache.org Received: (qmail 54823 invoked by uid 500); 10 May 2011 07:11:54 -0000 Delivered-To: apmail-hadoop-core-commits@hadoop.apache.org Received: (qmail 54820 invoked by uid 99); 10 May 2011 07:11:53 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 10 May 2011 07:11:53 +0000 X-ASF-Spam-Status: No, hits=-2000.0 required=5.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.131] (HELO eos.apache.org) (140.211.11.131) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 10 May 2011 07:11:52 +0000 Received: from eos.apache.org (localhost [127.0.0.1]) by eos.apache.org (Postfix) with ESMTP id 77CB2155; Tue, 10 May 2011 07:11:32 +0000 (UTC) MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable From: Apache Wiki To: Apache Wiki Date: Tue, 10 May 2011 07:11:32 -0000 Message-ID: <20110510071132.86693.84002@eos.apache.org> Subject: =?utf-8?q?=5BHadoop_Wiki=5D_Update_of_=22NextGenMapReduceDevTesting=22_by?= =?utf-8?q?_Arun_C_Murthy?= Dear Wiki user, You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for ch= ange notification. The "NextGenMapReduceDevTesting" page has been changed by Arun C Murthy. The comment on this change is: v1. http://wiki.apache.org/hadoop/NextGenMapReduceDevTesting -------------------------------------------------- New page: This wiki tracks developer-testing for NextGenMapReduce. This aim of this document is to capture various failure handling scenarios = for !MapReduce applications running under YARN and the YARN framework itsel= f. =3D=3D=3D Failure scenarios =3D=3D=3D =3D=3D=3D=3D User task error =3D=3D=3D=3D || '''Corrective measures''' || '''Developer(s) verifying the corrective me= asures''' || '''Date(s)''' || || RM is immediately notified of error by NM with appropriate error code/st= atus-msg || || || || !CapacityScheduler releases resources for queue, user and application ||= || || || RM notifies AM about status (including error code) of the container || |= | || || AM fails the task attempt || || || || AM re-runs task-attempt before other 'virgin' tasks on a _different node= _ || || || =3D=3D=3D=3D User task error, same task fails 4 times =3D=3D=3D=3D || '''Corrective measures''' || '''Developer(s) verifying the corrective me= asures''' || '''Date(s)''' || || RM is immediately notified of error by NM with appropriate error code/st= atus-msg || || || || !CapacityScheduler releases resources for queue, user and application ||= || || || RM notifies AM about status (including error code) of the container || |= | || || AM fails the task attempt || || || || AM re-runs task-attempt before other 'virgin' tasks on a _different node= _ || || || || AM fails the !MapReduce job and exits || || || =3D=3D=3D=3D Container failure =3D=3D=3D=3D =3D=3D=3D=3D=3D Localization error =3D=3D=3D=3D=3D || '''Corrective measures''' || '''Developer(s) verifying the corrective me= asures''' || '''Date(s)''' || || RM is immediately notified of error by NM with appropriate error code/st= atus-msg || || || || !CapacityScheduler releases resources for queue, user and application ||= || || || RM notifies AM about status (including error code) of the container || |= | || || AM fails the task attempt || || || || AM re-runs task-attempt before other 'virgin' tasks on a _different node= _ || || || =3D=3D=3D=3D=3D Exceeding memory or disk limits =3D=3D=3D=3D=3D || '''Corrective measures''' || '''Developer(s) verifying the corrective me= asures''' || '''Date(s)''' || || RM is immediately notified of error by NM with appropriate error code/st= atus-msg || || || || !CapacityScheduler releases resources for queue, user and application ||= || || RM notifies AM about status (including error code) of the container || |= | || || AM fails the task attempt || || || || AM re-runs task-attempt before other 'virgin' tasks on a _different node= _ || || || =3D=3D=3D=3D=3D Lost map output or faulty NM Netty =3D=3D=3D=3D=3D || '''Corrective measures''' || '''Developer(s) verifying the corrective me= asures''' || '''Date(s)''' || || Reduces report shuffle failure errors to AM || || || || On sufficient fetch-failure notifications the AM re-runs map || || || =3D=3D=3D=3D=3D User fails/kills map or reduce task =3D=3D=3D=3D=3D || '''Corrective measures''' || '''Developer(s) verifying the corrective me= asures''' || '''Date(s)''' || || RM is immediately notified of error by NM with appropriate error code/st= atus-msg || || || || !CapacityScheduler releases resources for queue, user and application ||= || || ||RM notifies AM about status (including error code) of the container || ||= || || AM fails the task attempt || || || || AM re-runs task-attempt before other 'virgin' tasks on a _different node= _ || || || =3D=3D=3D=3D Node failure due to timeout or health-check error =3D=3D=3D=3D || '''Corrective measures''' || '''Developer(s) verifying the corrective me= asures''' || '''Date(s)''' || || RM fails all running containers and informs appropriate AMs || || || || Shuffle failures for completed map containers... handled (aggressively?)= by AM || || || || AM re-runs running task-attempts and completed maps || || || =3D=3D=3D=3D !MapReduce AM failure =3D=3D=3D=3D || '''Corrective measures''' || '''Developer(s) verifying the corrective me= asures''' || '''Date(s)''' || || NM notifies RM || || || || !CapacityScheduler releases resources for queue, user and application ||= || || || ASM recognises AM failure || || || || ASM kills all running containers || || || || ASM restarts !MapReduce AM || || || || !MapReduce AM recovers and re-runs only non-complete tasks || || || =3D=3D=3D=3D !ResourceManager bounce =3D=3D=3D=3D || '''Corrective measures''' || '''Developer(s) verifying the corrective me= asures''' || '''Date(s)''' || || RM recovers all running AMs || || || || RM recovers all running containers || || || || RM rebuilds !CapacityScheduler queue & user capacities || || || || !MapReduce AMs re-runs only non-complete tasks || || ||