From mapreduce-issues-return-95479-archive-asf-public=cust-asf.ponee.io@hadoop.apache.org Wed Jun 16 10:54:13 2021 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mxout1-ec2-va.apache.org (mxout1-ec2-va.apache.org [3.227.148.255]) by mx-eu-01.ponee.io (Postfix) with ESMTPS id AB59918063B for ; Wed, 16 Jun 2021 12:54:13 +0200 (CEST) Received: from mail.apache.org (mailroute1-lw-us.apache.org [207.244.88.153]) by mxout1-ec2-va.apache.org (ASF Mail Server at mxout1-ec2-va.apache.org) with SMTP id 01FDE4110B for ; Wed, 16 Jun 2021 10:54:01 +0000 (UTC) Received: (qmail 77666 invoked by uid 500); 16 Jun 2021 10:54:01 -0000 Mailing-List: contact mapreduce-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list mapreduce-issues@hadoop.apache.org Received: (qmail 77634 invoked by uid 99); 16 Jun 2021 10:54:01 -0000 Received: from mailrelay1-he-de.apache.org (HELO mailrelay1-he-de.apache.org) (116.203.21.61) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 16 Jun 2021 10:54:01 +0000 Received: from jira2-he-de.apache.org (unknown [IPv6:2a01:4f8:242:1f49::2]) by mailrelay1-he-de.apache.org (ASF Mail Server at mailrelay1-he-de.apache.org) with ESMTPS id 3558D3E848 for ; Wed, 16 Jun 2021 10:54:00 +0000 (UTC) Received: from jira2-he-de.apache.org (localhost.localdomain [127.0.0.1]) by jira2-he-de.apache.org (ASF Mail Server at jira2-he-de.apache.org) with ESMTP id 1A163C801AD for ; Wed, 16 Jun 2021 10:54:00 +0000 (UTC) Date: Wed, 16 Jun 2021 10:54:00 +0000 (UTC) From: "Bilwa S T (Jira)" To: mapreduce-issues@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Updated] (MAPREDUCE-7353) Mapreduce job fails when NM is stopped MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/MAPREDUCE-7353?page=3Dcom.atla= ssian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bilwa S T updated MAPREDUCE-7353: --------------------------------- Attachment: MAPREDUCE-7353.001.patch > Mapreduce job fails when NM is stopped > -------------------------------------- > > Key: MAPREDUCE-7353 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-7353 > Project: Hadoop Map/Reduce > Issue Type: Bug > Reporter: Bilwa S T > Assignee: Bilwa S T > Priority: Major > Attachments: MAPREDUCE-7353.001.patch > > > Job fails as task fail due to too many fetch failures=C2=A0 > {code:java} > Line 48048: 2021-06-02 16:25:02,002 | INFO | ContainerLauncher #6 | Proc= essing the event EventType: CONTAINER_REMOTE_CLEANUP for container containe= r_e03_1622107691213_1054_01_000005 taskAttempt attempt_1622107691213_1054_m= _000000_0 | ContainerLauncherImpl.java:394 > =09Line 48053: 2021-06-02 16:25:02,002 | INFO | ContainerLauncher #6 | K= ILLING attempt_1622107691213_1054_m_000000_0 | ContainerLauncherImpl.java:2= 09 > =09Line 58026: 2021-06-02 16:26:34,034 | INFO | AsyncDispatcher event ha= ndler | TaskAttempt killed because it ran on unusable node node-group-1ZYEq= 0002:26009. AttemptId:attempt_1622107691213_1054_m_000000_0 | JobImpl.java:= 1401 > =09Line 58030: 2021-06-02 16:26:34,034 | DEBUG | AsyncDispatcher event ha= ndler | Processing attempt_1622107691213_1054_m_000000_0 of type TA_KILL | = TaskAttemptImpl.java:1390 > =09Line 58035: 2021-06-02 16:26:34,034 | INFO | RMCommunicator Allocator= | Killing taskAttempt:attempt_1622107691213_1054_m_000000_0 because it is = running on unusable node:node-group-1ZYEq0002:26009 | RMContainerAllocator.= java:1066 > =09Line 58043: 2021-06-02 16:26:34,034 | DEBUG | AsyncDispatcher event ha= ndler | Processing attempt_1622107691213_1054_m_000000_0 of type TA_KILL | = TaskAttemptImpl.java:1390 > =09Line 58054: 2021-06-02 16:26:34,034 | DEBUG | AsyncDispatcher event ha= ndler | Processing attempt_1622107691213_1054_m_000000_0 of type TA_DIAGNOS= TICS_UPDATE | TaskAttemptImpl.java:1390 > =09Line 58055: 2021-06-02 16:26:34,034 | INFO | AsyncDispatcher event ha= ndler | Diagnostics report from attempt_1622107691213_1054_m_000000_0: Cont= ainer released on a *lost* node | TaskAttemptImpl.java:2649 > =09Line 58057: 2021-06-02 16:26:34,034 | DEBUG | AsyncDispatcher event ha= ndler | Processing attempt_1622107691213_1054_m_000000_0 of type TA_KILL | = TaskAttemptImpl.java:1390 > =09Line 60317: 2021-06-02 16:26:57,057 | INFO | AsyncDispatcher event ha= ndler | Too many fetch-failures for output of task attempt: attempt_1622107= 691213_1054_m_000000_0 ... raising fetch failure to map | JobImpl.java:2005 > =09Line 60319: 2021-06-02 16:26:57,057 | DEBUG | AsyncDispatcher event ha= ndler | Processing attempt_1622107691213_1054_m_000000_0 of type TA_TOO_MAN= Y_FETCH_FAILURE | TaskAttemptImpl.java:1390 > =09Line 60320: 2021-06-02 16:26:57,057 | INFO | AsyncDispatcher event ha= ndler | attempt_1622107691213_1054_m_000000_0 transitioned from state SUCCE= SS_CONTAINER_CLEANUP to FAILED, event type is TA_TOO_MANY_FETCH_FAILURE and= nodeId=3Dnode-group-1ZYEq0002:26009 | TaskAttemptImpl.java:1411 > =09Line 69487: 2021-06-02 16:30:02,002 | DEBUG | AsyncDispatcher event ha= ndler | Processing attempt_1622107691213_1054_m_000000_0 of type TA_DIAGNOS= TICS_UPDATE | TaskAttemptImpl.java:1390 > =09Line 69527: 2021-06-02 16:30:02,002 | INFO | AsyncDispatcher event ha= ndler | Diagnostics report from attempt_1622107691213_1054_m_000000_0: clea= nup failed for container container_e03_1622107691213_1054_01_000005 : java.= net.ConnectException: Call From node-group-1ZYEq0001/192.168.0.66 to node-g= roup-1ZYEq0002:26009 failed on connection exception: java.net.ConnectExcept= ion: Connection refused; For more details see: http://wiki.apache.org/hado= op/ConnectionRefused > =09Line 69607: 2021-06-02 16:30:02,002 | DEBUG | AsyncDispatcher event ha= ndler | Processing attempt_1622107691213_1054_m_000000_0 of type TA_CONTAIN= ER_CLEANED | TaskAttemptImpl.java:1390 > =09Line 69609: 2021-06-02 16:30:02,002 | DEBUG | AsyncDispatcher event ha= ndler | Processing attempt_1622107691213_1054_m_000000_0 of type TA_CONTAIN= ER_CLEANED | TaskAttemptImpl.java:1390 > =09Line 73645: 2021-06-02 16:23:56,056 | DEBUG | fetcher#9 | Fetcher 9 go= ing to fetch from node-group-1ZYEq0002:26008 for: [attempt_1622107691213_10= 54_m_000000_0] | Fetcher.java:318 > =09Line 73646: 2021-06-02 16:23:56,056 | DEBUG | fetcher#9 | MapOutput UR= L for node-group-1ZYEq0002:26008 -> http://node-group-1ZYEq0002:26008/mapOu= tput?job=3Djob_1622107691213_1054&reduce=3D4&map=3Dattempt_1622107691213_10= 54_m_000000_0 | Fetcher.java:686 > =09Line 74093: 2021-06-02 16:26:56,056 | INFO | fetcher#9 | Reporting fe= tch failure for attempt_1622107691213_1054_m_000000_0 to MRAppMaster. | Shu= ffleSchedulerImpl.java:349 > {code} > As we can see from logs that RM reported AM about node update at 16:26:34= but event was skipped as KILL event is ignored when TaskAttemptImpl is in = SUCCESS_CONTAINER_CLEANUP state. So next we receive TA_TOO_MANY_FETCH_FAILU= RE event which will lead to task fail.=20 > =C2=A0 -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: mapreduce-issues-unsubscribe@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-help@hadoop.apache.org