From mapreduce-issues-return-95477-archive-asf-public=cust-asf.ponee.io@hadoop.apache.org  Wed Jun 16 07:14:08 2021
Return-Path: <mapreduce-issues-return-95477-archive-asf-public=cust-asf.ponee.io@hadoop.apache.org>
X-Original-To: archive-asf-public@cust-asf.ponee.io
Delivered-To: archive-asf-public@cust-asf.ponee.io
Received: from mxout1-ec2-va.apache.org (mxout1-ec2-va.apache.org [3.227.148.255])
	by mx-eu-01.ponee.io (Postfix) with ESMTPS id A544818063B
	for <archive-asf-public@cust-asf.ponee.io>; Wed, 16 Jun 2021 09:14:08 +0200 (CEST)
Received: from mail.apache.org (mailroute1-lw-us.apache.org [207.244.88.153])
	by mxout1-ec2-va.apache.org (ASF Mail Server at mxout1-ec2-va.apache.org) with SMTP id DB320405F8
	for <archive-asf-public@cust-asf.ponee.io>; Wed, 16 Jun 2021 07:14:06 +0000 (UTC)
Received: (qmail 77754 invoked by uid 500); 16 Jun 2021 07:14:03 -0000
Mailing-List: contact mapreduce-issues-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
List-Help: <mailto:mapreduce-issues-help@hadoop.apache.org>
List-Unsubscribe: <mailto:mapreduce-issues-unsubscribe@hadoop.apache.org>
List-Post: <mailto:mapreduce-issues@hadoop.apache.org>
List-Id: <mapreduce-issues.hadoop.apache.org>
Delivered-To: mailing list mapreduce-issues@hadoop.apache.org
Received: (qmail 77671 invoked by uid 99); 16 Jun 2021 07:14:01 -0000
Received: from mailrelay1-he-de.apache.org (HELO mailrelay1-he-de.apache.org) (116.203.21.61)
    by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 16 Jun 2021 07:14:01 +0000
Received: from jira2-he-de.apache.org (unknown [IPv6:2a01:4f8:242:1f49::2])
	by mailrelay1-he-de.apache.org (ASF Mail Server at mailrelay1-he-de.apache.org) with ESMTPS id 117B83E8CA
	for <mapreduce-issues@hadoop.apache.org>; Wed, 16 Jun 2021 07:14:01 +0000 (UTC)
Received: from jira2-he-de.apache.org (localhost.localdomain [127.0.0.1])
	by jira2-he-de.apache.org (ASF Mail Server at jira2-he-de.apache.org) with ESMTP id 63843C80A88
	for <mapreduce-issues@hadoop.apache.org>; Wed, 16 Jun 2021 07:14:00 +0000 (UTC)
Date: Wed, 16 Jun 2021 07:14:00 +0000 (UTC)
From: "Bilwa S T (Jira)" <jira@apache.org>
To: mapreduce-issues@hadoop.apache.org
Message-ID: <JIRA.13384081.1623827630000.608546.1623827640406@Atlassian.JIRA>
In-Reply-To: <JIRA.13384081.1623827630000@Atlassian.JIRA>
References: <JIRA.13384081.1623827630000@Atlassian.JIRA> <JIRA.13384081.1623827630227@jira2-he-de.apache.org>
Subject: [jira] [Created] (MAPREDUCE-7353) Mapreduce job fails when NM is
 stopped
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable
X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394

Bilwa S T created MAPREDUCE-7353:
------------------------------------

             Summary: Mapreduce job fails when NM is stopped
                 Key: MAPREDUCE-7353
                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-7353
             Project: Hadoop Map/Reduce
          Issue Type: Bug
            Reporter: Bilwa S T
            Assignee: Bilwa S T


Job fails as task fail due to too many fetch failures=C2=A0
{code:java}
Line 48048: 2021-06-02 16:25:02,002 | INFO  | ContainerLauncher #6 | Proces=
sing the event EventType: CONTAINER_REMOTE_CLEANUP for container container_=
e03_1622107691213_1054_01_000005 taskAttempt attempt_1622107691213_1054_m_0=
00000_0 | ContainerLauncherImpl.java:394
=09Line 48053: 2021-06-02 16:25:02,002 | INFO  | ContainerLauncher #6 | KIL=
LING attempt_1622107691213_1054_m_000000_0 | ContainerLauncherImpl.java:209
=09Line 58026: 2021-06-02 16:26:34,034 | INFO  | AsyncDispatcher event hand=
ler | TaskAttempt killed because it ran on unusable node node-group-1ZYEq00=
02:26009. AttemptId:attempt_1622107691213_1054_m_000000_0 | JobImpl.java:14=
01
=09Line 58030: 2021-06-02 16:26:34,034 | DEBUG | AsyncDispatcher event hand=
ler | Processing attempt_1622107691213_1054_m_000000_0 of type TA_KILL | Ta=
skAttemptImpl.java:1390
=09Line 58035: 2021-06-02 16:26:34,034 | INFO  | RMCommunicator Allocator |=
 Killing taskAttempt:attempt_1622107691213_1054_m_000000_0 because it is ru=
nning on unusable node:node-group-1ZYEq0002:26009 | RMContainerAllocator.ja=
va:1066
=09Line 58043: 2021-06-02 16:26:34,034 | DEBUG | AsyncDispatcher event hand=
ler | Processing attempt_1622107691213_1054_m_000000_0 of type TA_KILL | Ta=
skAttemptImpl.java:1390
=09Line 58054: 2021-06-02 16:26:34,034 | DEBUG | AsyncDispatcher event hand=
ler | Processing attempt_1622107691213_1054_m_000000_0 of type TA_DIAGNOSTI=
CS_UPDATE | TaskAttemptImpl.java:1390
=09Line 58055: 2021-06-02 16:26:34,034 | INFO  | AsyncDispatcher event hand=
ler | Diagnostics report from attempt_1622107691213_1054_m_000000_0: Contai=
ner released on a *lost* node | TaskAttemptImpl.java:2649
=09Line 58057: 2021-06-02 16:26:34,034 | DEBUG | AsyncDispatcher event hand=
ler | Processing attempt_1622107691213_1054_m_000000_0 of type TA_KILL | Ta=
skAttemptImpl.java:1390
=09Line 60317: 2021-06-02 16:26:57,057 | INFO  | AsyncDispatcher event hand=
ler | Too many fetch-failures for output of task attempt: attempt_162210769=
1213_1054_m_000000_0 ... raising fetch failure to map | JobImpl.java:2005
=09Line 60319: 2021-06-02 16:26:57,057 | DEBUG | AsyncDispatcher event hand=
ler | Processing attempt_1622107691213_1054_m_000000_0 of type TA_TOO_MANY_=
FETCH_FAILURE | TaskAttemptImpl.java:1390
=09Line 60320: 2021-06-02 16:26:57,057 | INFO  | AsyncDispatcher event hand=
ler | attempt_1622107691213_1054_m_000000_0 transitioned from state SUCCESS=
_CONTAINER_CLEANUP to FAILED, event type is TA_TOO_MANY_FETCH_FAILURE and n=
odeId=3Dnode-group-1ZYEq0002:26009 | TaskAttemptImpl.java:1411
=09Line 69487: 2021-06-02 16:30:02,002 | DEBUG | AsyncDispatcher event hand=
ler | Processing attempt_1622107691213_1054_m_000000_0 of type TA_DIAGNOSTI=
CS_UPDATE | TaskAttemptImpl.java:1390
=09Line 69527: 2021-06-02 16:30:02,002 | INFO  | AsyncDispatcher event hand=
ler | Diagnostics report from attempt_1622107691213_1054_m_000000_0: cleanu=
p failed for container container_e03_1622107691213_1054_01_000005 : java.ne=
t.ConnectException: Call From node-group-1ZYEq0001/192.168.0.66 to node-gro=
up-1ZYEq0002:26009 failed on connection exception: java.net.ConnectExceptio=
n: Connection refused; For more details see:  http://wiki.apache.org/hadoop=
/ConnectionRefused
=09Line 69607: 2021-06-02 16:30:02,002 | DEBUG | AsyncDispatcher event hand=
ler | Processing attempt_1622107691213_1054_m_000000_0 of type TA_CONTAINER=
_CLEANED | TaskAttemptImpl.java:1390
=09Line 69609: 2021-06-02 16:30:02,002 | DEBUG | AsyncDispatcher event hand=
ler | Processing attempt_1622107691213_1054_m_000000_0 of type TA_CONTAINER=
_CLEANED | TaskAttemptImpl.java:1390
=09Line 73645: 2021-06-02 16:23:56,056 | DEBUG | fetcher#9 | Fetcher 9 goin=
g to fetch from node-group-1ZYEq0002:26008 for: [attempt_1622107691213_1054=
_m_000000_0] | Fetcher.java:318
=09Line 73646: 2021-06-02 16:23:56,056 | DEBUG | fetcher#9 | MapOutput URL =
for node-group-1ZYEq0002:26008 -> http://node-group-1ZYEq0002:26008/mapOutp=
ut?job=3Djob_1622107691213_1054&reduce=3D4&map=3Dattempt_1622107691213_1054=
_m_000000_0 | Fetcher.java:686
=09Line 74093: 2021-06-02 16:26:56,056 | INFO  | fetcher#9 | Reporting fetc=
h failure for attempt_1622107691213_1054_m_000000_0 to MRAppMaster. | Shuff=
leSchedulerImpl.java:349
{code}

As we can see from logs that RM reported AM about node update at 16:26:34 b=
ut event was skipped as KILL event is ignored when TaskAttemptImpl is in SU=
CCESS_CONTAINER_CLEANUP state. So next we receive TA_TOO_MANY_FETCH_FAILURE=
 event which will lead to task fail.=20
=C2=A0


--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: mapreduce-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-help@hadoop.apache.org