From mapreduce-issues-return-95477-archive-asf-public=cust-asf.ponee.io@hadoop.apache.org Wed Jun 16 07:14:08 2021 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mxout1-ec2-va.apache.org (mxout1-ec2-va.apache.org [3.227.148.255]) by mx-eu-01.ponee.io (Postfix) with ESMTPS id A544818063B for ; Wed, 16 Jun 2021 09:14:08 +0200 (CEST) Received: from mail.apache.org (mailroute1-lw-us.apache.org [207.244.88.153]) by mxout1-ec2-va.apache.org (ASF Mail Server at mxout1-ec2-va.apache.org) with SMTP id DB320405F8 for ; Wed, 16 Jun 2021 07:14:06 +0000 (UTC) Received: (qmail 77754 invoked by uid 500); 16 Jun 2021 07:14:03 -0000 Mailing-List: contact mapreduce-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list mapreduce-issues@hadoop.apache.org Received: (qmail 77671 invoked by uid 99); 16 Jun 2021 07:14:01 -0000 Received: from mailrelay1-he-de.apache.org (HELO mailrelay1-he-de.apache.org) (116.203.21.61) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 16 Jun 2021 07:14:01 +0000 Received: from jira2-he-de.apache.org (unknown [IPv6:2a01:4f8:242:1f49::2]) by mailrelay1-he-de.apache.org (ASF Mail Server at mailrelay1-he-de.apache.org) with ESMTPS id 117B83E8CA for ; Wed, 16 Jun 2021 07:14:01 +0000 (UTC) Received: from jira2-he-de.apache.org (localhost.localdomain [127.0.0.1]) by jira2-he-de.apache.org (ASF Mail Server at jira2-he-de.apache.org) with ESMTP id 63843C80A88 for ; Wed, 16 Jun 2021 07:14:00 +0000 (UTC) Date: Wed, 16 Jun 2021 07:14:00 +0000 (UTC) From: "Bilwa S T (Jira)" To: mapreduce-issues@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Created] (MAPREDUCE-7353) Mapreduce job fails when NM is stopped MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 Bilwa S T created MAPREDUCE-7353: ------------------------------------ Summary: Mapreduce job fails when NM is stopped Key: MAPREDUCE-7353 URL: https://issues.apache.org/jira/browse/MAPREDUCE-7353 Project: Hadoop Map/Reduce Issue Type: Bug Reporter: Bilwa S T Assignee: Bilwa S T Job fails as task fail due to too many fetch failures=C2=A0 {code:java} Line 48048: 2021-06-02 16:25:02,002 | INFO | ContainerLauncher #6 | Proces= sing the event EventType: CONTAINER_REMOTE_CLEANUP for container container_= e03_1622107691213_1054_01_000005 taskAttempt attempt_1622107691213_1054_m_0= 00000_0 | ContainerLauncherImpl.java:394 =09Line 48053: 2021-06-02 16:25:02,002 | INFO | ContainerLauncher #6 | KIL= LING attempt_1622107691213_1054_m_000000_0 | ContainerLauncherImpl.java:209 =09Line 58026: 2021-06-02 16:26:34,034 | INFO | AsyncDispatcher event hand= ler | TaskAttempt killed because it ran on unusable node node-group-1ZYEq00= 02:26009. AttemptId:attempt_1622107691213_1054_m_000000_0 | JobImpl.java:14= 01 =09Line 58030: 2021-06-02 16:26:34,034 | DEBUG | AsyncDispatcher event hand= ler | Processing attempt_1622107691213_1054_m_000000_0 of type TA_KILL | Ta= skAttemptImpl.java:1390 =09Line 58035: 2021-06-02 16:26:34,034 | INFO | RMCommunicator Allocator |= Killing taskAttempt:attempt_1622107691213_1054_m_000000_0 because it is ru= nning on unusable node:node-group-1ZYEq0002:26009 | RMContainerAllocator.ja= va:1066 =09Line 58043: 2021-06-02 16:26:34,034 | DEBUG | AsyncDispatcher event hand= ler | Processing attempt_1622107691213_1054_m_000000_0 of type TA_KILL | Ta= skAttemptImpl.java:1390 =09Line 58054: 2021-06-02 16:26:34,034 | DEBUG | AsyncDispatcher event hand= ler | Processing attempt_1622107691213_1054_m_000000_0 of type TA_DIAGNOSTI= CS_UPDATE | TaskAttemptImpl.java:1390 =09Line 58055: 2021-06-02 16:26:34,034 | INFO | AsyncDispatcher event hand= ler | Diagnostics report from attempt_1622107691213_1054_m_000000_0: Contai= ner released on a *lost* node | TaskAttemptImpl.java:2649 =09Line 58057: 2021-06-02 16:26:34,034 | DEBUG | AsyncDispatcher event hand= ler | Processing attempt_1622107691213_1054_m_000000_0 of type TA_KILL | Ta= skAttemptImpl.java:1390 =09Line 60317: 2021-06-02 16:26:57,057 | INFO | AsyncDispatcher event hand= ler | Too many fetch-failures for output of task attempt: attempt_162210769= 1213_1054_m_000000_0 ... raising fetch failure to map | JobImpl.java:2005 =09Line 60319: 2021-06-02 16:26:57,057 | DEBUG | AsyncDispatcher event hand= ler | Processing attempt_1622107691213_1054_m_000000_0 of type TA_TOO_MANY_= FETCH_FAILURE | TaskAttemptImpl.java:1390 =09Line 60320: 2021-06-02 16:26:57,057 | INFO | AsyncDispatcher event hand= ler | attempt_1622107691213_1054_m_000000_0 transitioned from state SUCCESS= _CONTAINER_CLEANUP to FAILED, event type is TA_TOO_MANY_FETCH_FAILURE and n= odeId=3Dnode-group-1ZYEq0002:26009 | TaskAttemptImpl.java:1411 =09Line 69487: 2021-06-02 16:30:02,002 | DEBUG | AsyncDispatcher event hand= ler | Processing attempt_1622107691213_1054_m_000000_0 of type TA_DIAGNOSTI= CS_UPDATE | TaskAttemptImpl.java:1390 =09Line 69527: 2021-06-02 16:30:02,002 | INFO | AsyncDispatcher event hand= ler | Diagnostics report from attempt_1622107691213_1054_m_000000_0: cleanu= p failed for container container_e03_1622107691213_1054_01_000005 : java.ne= t.ConnectException: Call From node-group-1ZYEq0001/192.168.0.66 to node-gro= up-1ZYEq0002:26009 failed on connection exception: java.net.ConnectExceptio= n: Connection refused; For more details see: http://wiki.apache.org/hadoop= /ConnectionRefused =09Line 69607: 2021-06-02 16:30:02,002 | DEBUG | AsyncDispatcher event hand= ler | Processing attempt_1622107691213_1054_m_000000_0 of type TA_CONTAINER= _CLEANED | TaskAttemptImpl.java:1390 =09Line 69609: 2021-06-02 16:30:02,002 | DEBUG | AsyncDispatcher event hand= ler | Processing attempt_1622107691213_1054_m_000000_0 of type TA_CONTAINER= _CLEANED | TaskAttemptImpl.java:1390 =09Line 73645: 2021-06-02 16:23:56,056 | DEBUG | fetcher#9 | Fetcher 9 goin= g to fetch from node-group-1ZYEq0002:26008 for: [attempt_1622107691213_1054= _m_000000_0] | Fetcher.java:318 =09Line 73646: 2021-06-02 16:23:56,056 | DEBUG | fetcher#9 | MapOutput URL = for node-group-1ZYEq0002:26008 -> http://node-group-1ZYEq0002:26008/mapOutp= ut?job=3Djob_1622107691213_1054&reduce=3D4&map=3Dattempt_1622107691213_1054= _m_000000_0 | Fetcher.java:686 =09Line 74093: 2021-06-02 16:26:56,056 | INFO | fetcher#9 | Reporting fetc= h failure for attempt_1622107691213_1054_m_000000_0 to MRAppMaster. | Shuff= leSchedulerImpl.java:349 {code} As we can see from logs that RM reported AM about node update at 16:26:34 b= ut event was skipped as KILL event is ignored when TaskAttemptImpl is in SU= CCESS_CONTAINER_CLEANUP state. So next we receive TA_TOO_MANY_FETCH_FAILURE= event which will lead to task fail.=20 =C2=A0 -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: mapreduce-issues-unsubscribe@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-help@hadoop.apache.org