Return-Path: X-Original-To: apmail-hadoop-yarn-dev-archive@minotaur.apache.org Delivered-To: apmail-hadoop-yarn-dev-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id F3CEA10575 for ; Thu, 30 Jan 2014 02:48:22 +0000 (UTC) Received: (qmail 38539 invoked by uid 500); 30 Jan 2014 02:48:22 -0000 Delivered-To: apmail-hadoop-yarn-dev-archive@hadoop.apache.org Received: (qmail 38352 invoked by uid 500); 30 Jan 2014 02:48:20 -0000 Mailing-List: contact yarn-dev-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: yarn-dev@hadoop.apache.org Delivered-To: mailing list yarn-dev@hadoop.apache.org Received: (qmail 38322 invoked by uid 99); 30 Jan 2014 02:48:19 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 30 Jan 2014 02:48:19 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of rohithsharmaks@huawei.com designates 119.145.14.66 as permitted sender) Received: from [119.145.14.66] (HELO szxga03-in.huawei.com) (119.145.14.66) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 30 Jan 2014 02:48:15 +0000 Received: from 172.24.2.119 (EHLO szxeml212-edg.china.huawei.com) ([172.24.2.119]) by szxrg03-dlp.huawei.com (MOS 4.4.3-GA FastPath queued) with ESMTP id AJZ64482; Thu, 30 Jan 2014 10:47:49 +0800 (CST) Received: from SZXEML424-HUB.china.huawei.com (10.82.67.163) by szxeml212-edg.china.huawei.com (172.24.2.181) with Microsoft SMTP Server (TLS) id 14.3.158.1; Thu, 30 Jan 2014 10:47:47 +0800 Received: from SZXEML512-MBX.china.huawei.com ([169.254.7.156]) by szxeml424-hub.china.huawei.com ([10.82.67.163]) with mapi id 14.03.0158.001; Thu, 30 Jan 2014 10:47:46 +0800 From: Rohith Sharma K S To: "yarn-dev@hadoop.apache.org" CC: "mapreduce-dev@hadoop.apache.org" , "mapreduce-user@hadoop.apache.org" , "yarn-user@hadoop.apache.org" Subject: RE: Reducers are launched after jobClient is exited. Thread-Topic: Reducers are launched after jobClient is exited. Thread-Index: Ac8cnxZ1zG2miKMLTBOrfsgY6QXx3///t6IA//4xYtA= Date: Thu, 30 Jan 2014 02:47:46 +0000 Message-ID: <0EE80F6F7A98A64EBD18F2BE839C9115275A5C35@szxeml512-mbx.china.huawei.com> References: <0EE80F6F7A98A64EBD18F2BE839C9115275A171D@szxeml512-mbx.china.huawei.com> In-Reply-To: Accept-Language: en-US, zh-CN Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-originating-ip: [10.18.168.138] Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 X-CFilter-Loop: Reflected X-Virus-Checked: Checked by ClamAV on apache.org Thank you vinod for your reply..=20 I used FileOutputCommitter(Default) for committing file in job out directo= ry. I noticed that there is no commit/abort happened for Reducer Task when = Reducer is killed by NodeManager(stopContainer=20 request) . For reproducing test, I manually killed("kill" and "kill -9") Re= ducer task, and end up with same issue. I walkthrough YarnChild class and found there is NO shutdownHook is registe= red.=20 Why there is no shutdownHook for YarnChild? Is this intentional?=20 -----Original Message----- From: Vinod Kumar Vavilapalli [mailto:vinodkv@hortonworks.com]=20 Sent: 29 January 2014 12:17 To: yarn-dev@hadoop.apache.org Cc: mapreduce-dev@hadoop.apache.org; mapreduce-user@hadoop.apache.org; yarn= -user@hadoop.apache.org Subject: Re: Reducers are launched after jobClient is exited. MapReduce AppMaster and YARN at large use asynchronous event handling insid= e the JVM and so you may run into race conditions like this. Even otherwise, doing this in a deterministic manner is better achieved by = overriding your OutputCommitter. Job output commit/abort happens only once. +Vinod On Jan 28, 2014, at 7:06 PM, Rohith Sharma K S = wrote: > Hi All , >=20 > I ran job with 1 Map and 1 Reducers (=20 > mapreduce.job.reduce.slowstart.completedmaps=3D1 ). Map failed (=20 > because of error in Mapper implementation), but still Reducers are=20 > launched by applicationMaster. These reducers killed by=20 > applicationMaster while >=20 > stopping RMCommunicator service. >=20 >=20 > 1. Why Reducers are launching after job is finished.? ( Is this is = bug in MR? ) >=20 >=20 >=20 > Our use case is when job is finished(succeeded/failed),client program=20 > delete the JobOutput directory. Here, jobclient exit immediately after=20 > jobStatus is set. ( in below log, at 2014-01-23 07:34:43,166) >=20 >=20 >=20 > But , in the below log as mentioned reducers are launched later , Reducer= temporary directory and files are created(_temporary). These files left in= hdfs undeleted forever. >=20 > Kindly suggest your thoughts, how we can handle this situation? >=20 >=20 >=20 > 2014-01-23 07:34:43,151 INFO [AsyncDispatcher event handler]=20 > org.apache.hadoop.mapreduce.v2.app.job.impl.TaskImpl:=20 > task_1389970937094_0047_m_000000 Task Transitioned from RUNNING to=20 > FAILED > 2014-01-23 07:34:43,151 INFO [AsyncDispatcher event handler]=20 > org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: Num completed=20 > Tasks: 1 > 2014-01-23 07:34:43,151 INFO [AsyncDispatcher event handler]=20 > org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: Job failed as=20 > tasks failed. failedMaps:1 failedReduces:0 > 2014-01-23 07:34:43,153 INFO [AsyncDispatcher event handler]=20 > org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl:=20 > job_1389970937094_0047Job Transitioned from RUNNING to FAIL_ABORT > 2014-01-23 07:34:43,153 INFO [CommitterEvent Processor #0]=20 > org.apache.hadoop.mapreduce.v2.app.commit.CommitterEventHandler:=20 > Processing the event EventType: JOB_ABORT > 2014-01-23 07:34:43,166 INFO [AsyncDispatcher event handler]=20 > org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: job_1389970937094_00= 47Job Transitioned from FAIL_ABORT to FAILED ............... > ............... > 2014-01-23 07:34:43,707 INFO [RMCommunicator Allocator]=20 > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Before=20 > Scheduling: PendingReds:1 ScheduledMaps:0 ScheduledReds:0=20 > AssignedMaps:1 AssignedReds:0 CompletedMaps:1 CompletedReds:0=20 > ContAlloc:4 ContRel:0 HostLocal:1 RackLocal:0 > 2014-01-23 07:34:43,709 INFO [RMCommunicator Allocator]=20 > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator:=20 > Recalculating schedule, headroom=3D12288 > 2014-01-23 07:34:43,709 INFO [RMCommunicator Allocator] org.apache.hadoop= .mapreduce.v2.app.rm.RMContainerAllocator: Reduce slow start threshold reac= hed. Scheduling reduces. > 2014-01-23 07:34:43,709 INFO [RMCommunicator Allocator]=20 > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: All maps assi= gned. Ramping up all remaining reduces:1 ............... > ............... > 2014-01-23 07:34:45,714 INFO [RMCommunicator Allocator]=20 > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Got=20 > allocated containers 1 > 2014-01-23 07:34:45,714 INFO [RMCommunicator Allocator]=20 > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned=20 > to reduce > 2014-01-23 07:34:45,714 INFO [RMCommunicator Allocator]=20 > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned=20 > container container_1389970937094_0047_01_000006 to=20 > attempt_1389970937094_0047_r_000000_0 > ............... > ............... > 2014-01-23 07:34:45,724 INFO [AsyncDispatcher event handler]=20 > org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl:=20 > attempt_1389970937094_0047_r_000000_0 TaskAttempt Transitioned from=20 > UNASSIGNED to ASSIGNED > 2014-01-23 07:34:45,725 INFO [ContainerLauncher #8]=20 > org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl:=20 > Processing the event EventType: CONTAINER_REMOTE_LAUNCH for container=20 > container_1389970937094_0047_01_000006 taskAttempt=20 > attempt_1389970937094_0047_r_000000_0 > 2014-01-23 07:34:45,725 INFO [ContainerLauncher #8]=20 > org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl:=20 > Launching attempt_1389970937094_0047_r_000000_0 > 2014-01-23 07:34:45,727 INFO [ContainerLauncher #8]=20 > org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl:=20 > Shuffle port returned by ContainerManager for=20 > attempt_1389970937094_0047_r_000000_0 : 11234 > 2014-01-23 07:34:45,728 INFO [AsyncDispatcher event handler]=20 > org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl:=20 > TaskAttempt: [attempt_1389970937094_0047_r_000000_0] using=20 > containerId: [container_1389970937094_0047_01_000006 on NM:=20 > [linux85:11232] > 2014-01-23 07:34:45,728 INFO [AsyncDispatcher event handler]=20 > org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl:=20 > attempt_1389970937094_0047_r_000000_0 TaskAttempt Transitioned from=20 > ASSIGNED to RUNNING > 2014-01-23 07:34:45,728 INFO [AsyncDispatcher event handler]=20 > org.apache.hadoop.mapreduce.v2.app.job.impl.TaskImpl: task_1389970937094_= 0047_r_000000 Task Transitioned from SCHEDULED to RUNNING ............... > ............. > 2014-01-23 07:34:48,178 INFO [Thread-59]=20 > org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl:=20 > KILLING attempt_1389970937094_0047_r_000000_0 > 2014-01-23 07:34:48,180 INFO [AsyncDispatcher event handler]=20 > org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: attempt_1389= 970937094_0047_r_000000_0 TaskAttempt Transitioned from RUNNING to KILLED .= .............. > ............. >=20 >=20 > Thanks & Regards > Rohith -- CONFIDENTIALITY NOTICE NOTICE: This message is intended for the use of the individual or entity to= which it is addressed and may contain information that is confidential, pr= ivileged and exempt from disclosure under applicable law. If the reader of = this message is not the intended recipient, you are hereby notified that an= y printing, copying, dissemination, distribution, disclosure or forwarding = of this communication is strictly prohibited. If you have received this com= munication in error, please contact the sender immediately and delete it fr= om your system. Thank You.