From mapreduce-issues-return-95509-archive-asf-public=cust-asf.ponee.io@hadoop.apache.org  Wed Jun 30 10:03:04 2021
Return-Path: <mapreduce-issues-return-95509-archive-asf-public=cust-asf.ponee.io@hadoop.apache.org>
X-Original-To: archive-asf-public@cust-asf.ponee.io
Delivered-To: archive-asf-public@cust-asf.ponee.io
Received: from mxout1-he-de.apache.org (mxout1-he-de.apache.org [95.216.194.37])
	by mx-eu-01.ponee.io (Postfix) with ESMTPS id 2AFDF18060E
	for <archive-asf-public@cust-asf.ponee.io>; Wed, 30 Jun 2021 12:03:04 +0200 (CEST)
Received: from mail.apache.org (mailroute1-lw-us.apache.org [207.244.88.153])
	by mxout1-he-de.apache.org (ASF Mail Server at mxout1-he-de.apache.org) with SMTP id F3274619E2
	for <archive-asf-public@cust-asf.ponee.io>; Wed, 30 Jun 2021 10:03:02 +0000 (UTC)
Received: (qmail 45081 invoked by uid 500); 30 Jun 2021 10:03:01 -0000
Mailing-List: contact mapreduce-issues-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
List-Help: <mailto:mapreduce-issues-help@hadoop.apache.org>
List-Unsubscribe: <mailto:mapreduce-issues-unsubscribe@hadoop.apache.org>
List-Post: <mailto:mapreduce-issues@hadoop.apache.org>
List-Id: <mapreduce-issues.hadoop.apache.org>
Delivered-To: mailing list mapreduce-issues@hadoop.apache.org
Received: (qmail 45045 invoked by uid 99); 30 Jun 2021 10:03:01 -0000
Received: from ec2-52-204-25-47.compute-1.amazonaws.com (HELO mailrelay1-ec2-va.apache.org) (52.204.25.47)
    by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 30 Jun 2021 10:03:01 +0000
Received: from jira2-he-de.apache.org (jira2-he-de.apache.org [168.119.33.54])
	by mailrelay1-ec2-va.apache.org (ASF Mail Server at mailrelay1-ec2-va.apache.org) with ESMTPS id 50E813E96D
	for <mapreduce-issues@hadoop.apache.org>; Wed, 30 Jun 2021 10:03:01 +0000 (UTC)
Received: from jira2-he-de.apache.org (localhost.localdomain [127.0.0.1])
	by jira2-he-de.apache.org (ASF Mail Server at jira2-he-de.apache.org) with ESMTP id 5FCAEC80441
	for <mapreduce-issues@hadoop.apache.org>; Wed, 30 Jun 2021 10:03:00 +0000 (UTC)
Date: Wed, 30 Jun 2021 10:03:00 +0000 (UTC)
From: "Bilwa S T (Jira)" <jira@apache.org>
To: mapreduce-issues@hadoop.apache.org
Message-ID: <JIRA.13384081.1623827630000.672892.1625047380391@Atlassian.JIRA>
In-Reply-To: <JIRA.13384081.1623827630000@Atlassian.JIRA>
References: <JIRA.13384081.1623827630000@Atlassian.JIRA> <JIRA.13384081.1623827630227@jira2-he-de.apache.org>
Subject: [jira] [Commented] (MAPREDUCE-7353) Mapreduce job fails when NM is
 stopped
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable
X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394


    [ https://issues.apache.org/jira/browse/MAPREDUCE-7353?page=3Dcom.atlas=
sian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=3D=
17371932#comment-17371932 ]=20

Bilwa S T commented on MAPREDUCE-7353:
--------------------------------------

Hi [~epayne] can you please check updated patch? Thanks

> Mapreduce job fails when NM is stopped
> --------------------------------------
>
>                 Key: MAPREDUCE-7353
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-7353
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>            Reporter: Bilwa S T
>            Assignee: Bilwa S T
>            Priority: Major
>         Attachments: MAPREDUCE-7353.001.patch, MAPREDUCE-7353.002.patch
>
>
> Job fails as task fail due to too many fetch failures=C2=A0
> {code:java}
> Line 48048: 2021-06-02 16:25:02,002 | INFO  | ContainerLauncher #6 | Proc=
essing the event EventType: CONTAINER_REMOTE_CLEANUP for container containe=
r_e03_1622107691213_1054_01_000005 taskAttempt attempt_1622107691213_1054_m=
_000000_0 | ContainerLauncherImpl.java:394
> =09Line 48053: 2021-06-02 16:25:02,002 | INFO  | ContainerLauncher #6 | K=
ILLING attempt_1622107691213_1054_m_000000_0 | ContainerLauncherImpl.java:2=
09
> =09Line 58026: 2021-06-02 16:26:34,034 | INFO  | AsyncDispatcher event ha=
ndler | TaskAttempt killed because it ran on unusable node node-group-1ZYEq=
0002:26009. AttemptId:attempt_1622107691213_1054_m_000000_0 | JobImpl.java:=
1401
> =09Line 58030: 2021-06-02 16:26:34,034 | DEBUG | AsyncDispatcher event ha=
ndler | Processing attempt_1622107691213_1054_m_000000_0 of type TA_KILL | =
TaskAttemptImpl.java:1390
> =09Line 58035: 2021-06-02 16:26:34,034 | INFO  | RMCommunicator Allocator=
 | Killing taskAttempt:attempt_1622107691213_1054_m_000000_0 because it is =
running on unusable node:node-group-1ZYEq0002:26009 | RMContainerAllocator.=
java:1066
> =09Line 58043: 2021-06-02 16:26:34,034 | DEBUG | AsyncDispatcher event ha=
ndler | Processing attempt_1622107691213_1054_m_000000_0 of type TA_KILL | =
TaskAttemptImpl.java:1390
> =09Line 58054: 2021-06-02 16:26:34,034 | DEBUG | AsyncDispatcher event ha=
ndler | Processing attempt_1622107691213_1054_m_000000_0 of type TA_DIAGNOS=
TICS_UPDATE | TaskAttemptImpl.java:1390
> =09Line 58055: 2021-06-02 16:26:34,034 | INFO  | AsyncDispatcher event ha=
ndler | Diagnostics report from attempt_1622107691213_1054_m_000000_0: Cont=
ainer released on a *lost* node | TaskAttemptImpl.java:2649
> =09Line 58057: 2021-06-02 16:26:34,034 | DEBUG | AsyncDispatcher event ha=
ndler | Processing attempt_1622107691213_1054_m_000000_0 of type TA_KILL | =
TaskAttemptImpl.java:1390
> =09Line 60317: 2021-06-02 16:26:57,057 | INFO  | AsyncDispatcher event ha=
ndler | Too many fetch-failures for output of task attempt: attempt_1622107=
691213_1054_m_000000_0 ... raising fetch failure to map | JobImpl.java:2005
> =09Line 60319: 2021-06-02 16:26:57,057 | DEBUG | AsyncDispatcher event ha=
ndler | Processing attempt_1622107691213_1054_m_000000_0 of type TA_TOO_MAN=
Y_FETCH_FAILURE | TaskAttemptImpl.java:1390
> =09Line 60320: 2021-06-02 16:26:57,057 | INFO  | AsyncDispatcher event ha=
ndler | attempt_1622107691213_1054_m_000000_0 transitioned from state SUCCE=
SS_CONTAINER_CLEANUP to FAILED, event type is TA_TOO_MANY_FETCH_FAILURE and=
 nodeId=3Dnode-group-1ZYEq0002:26009 | TaskAttemptImpl.java:1411
> =09Line 69487: 2021-06-02 16:30:02,002 | DEBUG | AsyncDispatcher event ha=
ndler | Processing attempt_1622107691213_1054_m_000000_0 of type TA_DIAGNOS=
TICS_UPDATE | TaskAttemptImpl.java:1390
> =09Line 69527: 2021-06-02 16:30:02,002 | INFO  | AsyncDispatcher event ha=
ndler | Diagnostics report from attempt_1622107691213_1054_m_000000_0: clea=
nup failed for container container_e03_1622107691213_1054_01_000005 : java.=
net.ConnectException: Call From node-group-1ZYEq0001/192.168.0.66 to node-g=
roup-1ZYEq0002:26009 failed on connection exception: java.net.ConnectExcept=
ion: Connection refused; For more details see:  http://wiki.apache.org/hado=
op/ConnectionRefused
> =09Line 69607: 2021-06-02 16:30:02,002 | DEBUG | AsyncDispatcher event ha=
ndler | Processing attempt_1622107691213_1054_m_000000_0 of type TA_CONTAIN=
ER_CLEANED | TaskAttemptImpl.java:1390
> =09Line 69609: 2021-06-02 16:30:02,002 | DEBUG | AsyncDispatcher event ha=
ndler | Processing attempt_1622107691213_1054_m_000000_0 of type TA_CONTAIN=
ER_CLEANED | TaskAttemptImpl.java:1390
> =09Line 73645: 2021-06-02 16:23:56,056 | DEBUG | fetcher#9 | Fetcher 9 go=
ing to fetch from node-group-1ZYEq0002:26008 for: [attempt_1622107691213_10=
54_m_000000_0] | Fetcher.java:318
> =09Line 73646: 2021-06-02 16:23:56,056 | DEBUG | fetcher#9 | MapOutput UR=
L for node-group-1ZYEq0002:26008 -> http://node-group-1ZYEq0002:26008/mapOu=
tput?job=3Djob_1622107691213_1054&reduce=3D4&map=3Dattempt_1622107691213_10=
54_m_000000_0 | Fetcher.java:686
> =09Line 74093: 2021-06-02 16:26:56,056 | INFO  | fetcher#9 | Reporting fe=
tch failure for attempt_1622107691213_1054_m_000000_0 to MRAppMaster. | Shu=
ffleSchedulerImpl.java:349
> {code}
> As we can see from logs that RM reported AM about node update at 16:26:34=
 but event was skipped as KILL event is ignored when TaskAttemptImpl is in =
SUCCESS_CONTAINER_CLEANUP state. So next we receive TA_TOO_MANY_FETCH_FAILU=
RE event which will lead to task fail.=20
> =C2=A0


--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: mapreduce-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-help@hadoop.apache.org