From mapreduce-issues-return-95509-archive-asf-public=cust-asf.ponee.io@hadoop.apache.org Wed Jun 30 10:03:04 2021 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mxout1-he-de.apache.org (mxout1-he-de.apache.org [95.216.194.37]) by mx-eu-01.ponee.io (Postfix) with ESMTPS id 2AFDF18060E for ; Wed, 30 Jun 2021 12:03:04 +0200 (CEST) Received: from mail.apache.org (mailroute1-lw-us.apache.org [207.244.88.153]) by mxout1-he-de.apache.org (ASF Mail Server at mxout1-he-de.apache.org) with SMTP id F3274619E2 for ; Wed, 30 Jun 2021 10:03:02 +0000 (UTC) Received: (qmail 45081 invoked by uid 500); 30 Jun 2021 10:03:01 -0000 Mailing-List: contact mapreduce-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list mapreduce-issues@hadoop.apache.org Received: (qmail 45045 invoked by uid 99); 30 Jun 2021 10:03:01 -0000 Received: from ec2-52-204-25-47.compute-1.amazonaws.com (HELO mailrelay1-ec2-va.apache.org) (52.204.25.47) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 30 Jun 2021 10:03:01 +0000 Received: from jira2-he-de.apache.org (jira2-he-de.apache.org [168.119.33.54]) by mailrelay1-ec2-va.apache.org (ASF Mail Server at mailrelay1-ec2-va.apache.org) with ESMTPS id 50E813E96D for ; Wed, 30 Jun 2021 10:03:01 +0000 (UTC) Received: from jira2-he-de.apache.org (localhost.localdomain [127.0.0.1]) by jira2-he-de.apache.org (ASF Mail Server at jira2-he-de.apache.org) with ESMTP id 5FCAEC80441 for ; Wed, 30 Jun 2021 10:03:00 +0000 (UTC) Date: Wed, 30 Jun 2021 10:03:00 +0000 (UTC) From: "Bilwa S T (Jira)" To: mapreduce-issues@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (MAPREDUCE-7353) Mapreduce job fails when NM is stopped MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/MAPREDUCE-7353?page=3Dcom.atlas= sian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=3D= 17371932#comment-17371932 ]=20 Bilwa S T commented on MAPREDUCE-7353: -------------------------------------- Hi [~epayne] can you please check updated patch? Thanks > Mapreduce job fails when NM is stopped > -------------------------------------- > > Key: MAPREDUCE-7353 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-7353 > Project: Hadoop Map/Reduce > Issue Type: Bug > Reporter: Bilwa S T > Assignee: Bilwa S T > Priority: Major > Attachments: MAPREDUCE-7353.001.patch, MAPREDUCE-7353.002.patch > > > Job fails as task fail due to too many fetch failures=C2=A0 > {code:java} > Line 48048: 2021-06-02 16:25:02,002 | INFO | ContainerLauncher #6 | Proc= essing the event EventType: CONTAINER_REMOTE_CLEANUP for container containe= r_e03_1622107691213_1054_01_000005 taskAttempt attempt_1622107691213_1054_m= _000000_0 | ContainerLauncherImpl.java:394 > =09Line 48053: 2021-06-02 16:25:02,002 | INFO | ContainerLauncher #6 | K= ILLING attempt_1622107691213_1054_m_000000_0 | ContainerLauncherImpl.java:2= 09 > =09Line 58026: 2021-06-02 16:26:34,034 | INFO | AsyncDispatcher event ha= ndler | TaskAttempt killed because it ran on unusable node node-group-1ZYEq= 0002:26009. AttemptId:attempt_1622107691213_1054_m_000000_0 | JobImpl.java:= 1401 > =09Line 58030: 2021-06-02 16:26:34,034 | DEBUG | AsyncDispatcher event ha= ndler | Processing attempt_1622107691213_1054_m_000000_0 of type TA_KILL | = TaskAttemptImpl.java:1390 > =09Line 58035: 2021-06-02 16:26:34,034 | INFO | RMCommunicator Allocator= | Killing taskAttempt:attempt_1622107691213_1054_m_000000_0 because it is = running on unusable node:node-group-1ZYEq0002:26009 | RMContainerAllocator.= java:1066 > =09Line 58043: 2021-06-02 16:26:34,034 | DEBUG | AsyncDispatcher event ha= ndler | Processing attempt_1622107691213_1054_m_000000_0 of type TA_KILL | = TaskAttemptImpl.java:1390 > =09Line 58054: 2021-06-02 16:26:34,034 | DEBUG | AsyncDispatcher event ha= ndler | Processing attempt_1622107691213_1054_m_000000_0 of type TA_DIAGNOS= TICS_UPDATE | TaskAttemptImpl.java:1390 > =09Line 58055: 2021-06-02 16:26:34,034 | INFO | AsyncDispatcher event ha= ndler | Diagnostics report from attempt_1622107691213_1054_m_000000_0: Cont= ainer released on a *lost* node | TaskAttemptImpl.java:2649 > =09Line 58057: 2021-06-02 16:26:34,034 | DEBUG | AsyncDispatcher event ha= ndler | Processing attempt_1622107691213_1054_m_000000_0 of type TA_KILL | = TaskAttemptImpl.java:1390 > =09Line 60317: 2021-06-02 16:26:57,057 | INFO | AsyncDispatcher event ha= ndler | Too many fetch-failures for output of task attempt: attempt_1622107= 691213_1054_m_000000_0 ... raising fetch failure to map | JobImpl.java:2005 > =09Line 60319: 2021-06-02 16:26:57,057 | DEBUG | AsyncDispatcher event ha= ndler | Processing attempt_1622107691213_1054_m_000000_0 of type TA_TOO_MAN= Y_FETCH_FAILURE | TaskAttemptImpl.java:1390 > =09Line 60320: 2021-06-02 16:26:57,057 | INFO | AsyncDispatcher event ha= ndler | attempt_1622107691213_1054_m_000000_0 transitioned from state SUCCE= SS_CONTAINER_CLEANUP to FAILED, event type is TA_TOO_MANY_FETCH_FAILURE and= nodeId=3Dnode-group-1ZYEq0002:26009 | TaskAttemptImpl.java:1411 > =09Line 69487: 2021-06-02 16:30:02,002 | DEBUG | AsyncDispatcher event ha= ndler | Processing attempt_1622107691213_1054_m_000000_0 of type TA_DIAGNOS= TICS_UPDATE | TaskAttemptImpl.java:1390 > =09Line 69527: 2021-06-02 16:30:02,002 | INFO | AsyncDispatcher event ha= ndler | Diagnostics report from attempt_1622107691213_1054_m_000000_0: clea= nup failed for container container_e03_1622107691213_1054_01_000005 : java.= net.ConnectException: Call From node-group-1ZYEq0001/192.168.0.66 to node-g= roup-1ZYEq0002:26009 failed on connection exception: java.net.ConnectExcept= ion: Connection refused; For more details see: http://wiki.apache.org/hado= op/ConnectionRefused > =09Line 69607: 2021-06-02 16:30:02,002 | DEBUG | AsyncDispatcher event ha= ndler | Processing attempt_1622107691213_1054_m_000000_0 of type TA_CONTAIN= ER_CLEANED | TaskAttemptImpl.java:1390 > =09Line 69609: 2021-06-02 16:30:02,002 | DEBUG | AsyncDispatcher event ha= ndler | Processing attempt_1622107691213_1054_m_000000_0 of type TA_CONTAIN= ER_CLEANED | TaskAttemptImpl.java:1390 > =09Line 73645: 2021-06-02 16:23:56,056 | DEBUG | fetcher#9 | Fetcher 9 go= ing to fetch from node-group-1ZYEq0002:26008 for: [attempt_1622107691213_10= 54_m_000000_0] | Fetcher.java:318 > =09Line 73646: 2021-06-02 16:23:56,056 | DEBUG | fetcher#9 | MapOutput UR= L for node-group-1ZYEq0002:26008 -> http://node-group-1ZYEq0002:26008/mapOu= tput?job=3Djob_1622107691213_1054&reduce=3D4&map=3Dattempt_1622107691213_10= 54_m_000000_0 | Fetcher.java:686 > =09Line 74093: 2021-06-02 16:26:56,056 | INFO | fetcher#9 | Reporting fe= tch failure for attempt_1622107691213_1054_m_000000_0 to MRAppMaster. | Shu= ffleSchedulerImpl.java:349 > {code} > As we can see from logs that RM reported AM about node update at 16:26:34= but event was skipped as KILL event is ignored when TaskAttemptImpl is in = SUCCESS_CONTAINER_CLEANUP state. So next we receive TA_TOO_MANY_FETCH_FAILU= RE event which will lead to task fail.=20 > =C2=A0 -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: mapreduce-issues-unsubscribe@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-help@hadoop.apache.org