From mapreduce-issues-return-95464-archive-asf-public=cust-asf.ponee.io@hadoop.apache.org Fri Jun 11 06:53:05 2021 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mxout1-ec2-va.apache.org (mxout1-ec2-va.apache.org [3.227.148.255]) by mx-eu-01.ponee.io (Postfix) with ESMTPS id 508B9180654 for ; Fri, 11 Jun 2021 08:53:05 +0200 (CEST) Received: from mail.apache.org (mailroute1-lw-us.apache.org [207.244.88.153]) by mxout1-ec2-va.apache.org (ASF Mail Server at mxout1-ec2-va.apache.org) with SMTP id 8AD5A43997 for ; Fri, 11 Jun 2021 06:53:04 +0000 (UTC) Received: (qmail 80767 invoked by uid 500); 11 Jun 2021 06:53:02 -0000 Mailing-List: contact mapreduce-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list mapreduce-issues@hadoop.apache.org Received: (qmail 80752 invoked by uid 99); 11 Jun 2021 06:53:02 -0000 Received: from mailrelay1-he-de.apache.org (HELO mailrelay1-he-de.apache.org) (116.203.21.61) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 11 Jun 2021 06:53:02 +0000 Received: from jira2-he-de.apache.org (jira2-he-de.apache.org [168.119.33.54]) by mailrelay1-he-de.apache.org (ASF Mail Server at mailrelay1-he-de.apache.org) with ESMTPS id BC8B53E8D2 for ; Fri, 11 Jun 2021 06:53:00 +0000 (UTC) Received: from jira2-he-de.apache.org (localhost.localdomain [127.0.0.1]) by jira2-he-de.apache.org (ASF Mail Server at jira2-he-de.apache.org) with ESMTP id 69598C80916 for ; Fri, 11 Jun 2021 06:53:00 +0000 (UTC) Date: Fri, 11 Jun 2021 06:53:00 +0000 (UTC) From: "luhuachao (Jira)" To: mapreduce-issues@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Updated] (MAPREDUCE-7240) Exception ' Invalid event: TA_TOO_MANY_FETCH_FAILURE at SUCCESS_FINISHING_CONTAINER' cause job error MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/MAPREDUCE-7240?page=3Dcom.atla= ssian.jira.plugin.system.issuetabpanels:all-tabpanel ] luhuachao updated MAPREDUCE-7240: --------------------------------- Attachment: (was: application_1566552310686_260041.log) > Exception ' Invalid event: TA_TOO_MANY_FETCH_FAILURE at SUCCESS_FINISHING= _CONTAINER' cause job error > -------------------------------------------------------------------------= --------------------------- > > Key: MAPREDUCE-7240 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-7240 > Project: Hadoop Map/Reduce > Issue Type: Bug > Affects Versions: 2.8.2 > Reporter: luhuachao > Assignee: luhuachao > Priority: Critical > Labels: Reviewed, applicationmaster, mrv2 > Fix For: 3.3.0, 3.1.4, 3.2.2, 2.10.1 > > Attachments: MAPREDUCE-7240-001.patch, MAPREDUCE-7240-002.patch, = MAPREDUCE-7240-branch-3.1.001.patch, MAPREDUCE-7240-branch-3.2.001.patch, M= APREDUCE-7240-branch-3.2.001.patch > > > *log in appmaster* > {noformat} > 2019-09-03 17:18:43,090 INFO [AsyncDispatcher event handler] org.apache.h= adoop.mapreduce.v2.app.job.impl.JobImpl: Too many fetch-failures for output= of task attempt: attempt_1566552310686_260041_m_000052_0 ... raising fetch= failure to map > 2019-09-03 17:18:43,091 INFO [AsyncDispatcher event handler] org.apache.h= adoop.mapreduce.v2.app.job.impl.JobImpl: Too many fetch-failures for output= of task attempt: attempt_1566552310686_260041_m_000049_0 ... raising fetch= failure to map > 2019-09-03 17:18:43,091 INFO [AsyncDispatcher event handler] org.apache.h= adoop.mapreduce.v2.app.job.impl.JobImpl: Too many fetch-failures for output= of task attempt: attempt_1566552310686_260041_m_000051_0 ... raising fetch= failure to map > 2019-09-03 17:18:43,091 INFO [AsyncDispatcher event handler] org.apache.h= adoop.mapreduce.v2.app.job.impl.JobImpl: Too many fetch-failures for output= of task attempt: attempt_1566552310686_260041_m_000050_0 ... raising fetch= failure to map > 2019-09-03 17:18:43,091 INFO [AsyncDispatcher event handler] org.apache.h= adoop.mapreduce.v2.app.job.impl.JobImpl: Too many fetch-failures for output= of task attempt: attempt_1566552310686_260041_m_000053_0 ... raising fetch= failure to map > 2019-09-03 17:18:43,092 INFO [AsyncDispatcher event handler] org.apache.h= adoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: attempt_1566552310686_2600= 41_m_000052_0 transitioned from state SUCCEEDED to FAILED, event type is TA= _TOO_MANY_FETCH_FAILURE and nodeId=3Dyarn095:45454 > 2019-09-03 17:18:43,092 ERROR [AsyncDispatcher event handler] org.apache.= hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Can't handle this event a= t current state for attempt_1566552310686_260041_m_000049_0 > org.apache.hadoop.yarn.state.InvalidStateTransitionException: Invalid eve= nt: TA_TOO_MANY_FETCH_FAILURE at SUCCESS_FINISHING_CONTAINER > =09at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(State= MachineFactory.java:305) > =09at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMa= chineFactory.java:46) > =09at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachi= ne.doTransition(StateMachineFactory.java:448) > =09at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl.handle(= TaskAttemptImpl.java:1206) > =09at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl.handle(= TaskAttemptImpl.java:146) > =09at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$TaskAttemptEventDisp= atcher.handle(MRAppMaster.java:1458) > =09at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$TaskAttemptEventDisp= atcher.handle(MRAppMaster.java:1450) > =09at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatch= er.java:184) > =09at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.= java:110) > =09at java.lang.Thread.run(Thread.java:745) > 2019-09-03 17:18:43,093 ERROR [AsyncDispatcher event handler] org.apache.= hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Can't handle this event a= t current state for attempt_1566552310686_260041_m_000051_0 > org.apache.hadoop.yarn.state.InvalidStateTransitionException: Invalid eve= nt: TA_TOO_MANY_FETCH_FAILURE at SUCCESS_FINISHING_CONTAINER > =09at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(State= MachineFactory.java:305) > =09at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMa= chineFactory.java:46) > =09at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachi= ne.doTransition(StateMachineFactory.java:448) > =09at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl.handle(= TaskAttemptImpl.java:1206) > =09at org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl.handle(= TaskAttemptImpl.java:146) > =09at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$TaskAttemptEventDisp= atcher.handle(MRAppMaster.java:1458) > =09at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$TaskAttemptEventDisp= atcher.handle(MRAppMaster.java:1450) > =09at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatch= er.java:184) > =09at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.= java:110) > =09at java.lang.Thread.run(Thread.java:745) > 2019-09-03 17:18:43,093 INFO [AsyncDispatcher event handler] org.apache.h= adoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: attempt_1566552310686_2600= 41_m_000050_0 transitioned from state SUCCEEDED to FAILED, event type is TA= _TOO_MANY_FETCH_FAILURE and nodeId=3Dyarn095:45454 > 2019-09-03 17:18:43,093 INFO [AsyncDispatcher event handler] org.apache.h= adoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: attempt_1566552310686_2600= 41_m_000053_0 transitioned from state SUCCEEDED to FAILED, event type is TA= _TOO_MANY_FETCH_FAILURE and nodeId=3Dyarn095:45454 > 2019-09-03 17:18:43,094 INFO [AsyncDispatcher event handler] org.apache.h= adoop.mapreduce.v2.app.job.impl.TaskImpl: task_1566552310686_260041_m_00005= 2 Task Transitioned from SUCCEEDED to SCHEDULED > 2019-09-03 17:18:43,096 FATAL [IPC Server handler 27 on 35972] org.apache= .hadoop.mapred.TaskAttemptListenerImpl: Task: attempt_1566552310686_260041_= r_000005_0 - exited : org.apache.hadoop.mapreduce.task.reduce.Shuffle$Shuff= leError: error in shuffle in fetcher#22 > =09at org.apache.hadoop.mapreduce.task.reduce.Shuffle.run(Shuffle.java:13= 4) > =09at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:376) > =09at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:175) > =09at java.security.AccessController.doPrivileged(Native Method) > =09at javax.security.auth.Subject.doAs(Subject.java:422) > =09at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInfor= mation.java:1961) > =09at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:169) > Caused by: java.io.IOException: Exceeded MAX_FAILED_UNIQUE_FETCHES; baili= ng-out. > =09at org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl.checkR= educerHealth(ShuffleSchedulerImpl.java:367) > =09at org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl.copyFa= iled(ShuffleSchedulerImpl.java:289) > =09at org.apache.hadoop.mapreduce.task.reduce.Fetcher.copyFromHost(Fetche= r.java:355) > =09at org.apache.hadoop.mapreduce.task.reduce.Fetcher.run(Fetcher.java:19= 3) > 2019-09-03 17:18:43,096 INFO [IPC Server handler 27 on 35972] org.apache.= hadoop.mapred.TaskAttemptListenerImpl: Diagnostics report from attempt_1566= 552310686_260041_r_000005_0: Error: org.apache.hadoop.mapreduce.task.reduce= .Shuffle$ShuffleError: error in shuffle in fetcher#22 > =09at org.apache.hadoop.mapreduce.task.reduce.Shuffle.run(Shuffle.java:13= 4) > =09at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:376) > =09at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:175) > =09at java.security.AccessController.doPrivileged(Native Method) > =09at javax.security.auth.Subject.doAs(Subject.java:422) > =09at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInfor= mation.java:1961) > =09at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:169) > Caused by: java.io.IOException: Exceeded MAX_FAILED_UNIQUE_FETCHES; baili= ng-out. > =09at org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl.checkR= educerHealth(ShuffleSchedulerImpl.java:367) > =09at org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl.copyFa= iled(ShuffleSchedulerImpl.java:289) > =09at org.apache.hadoop.mapreduce.task.reduce.Fetcher.copyFromHost(Fetche= r.java:355) > =09at org.apache.hadoop.mapreduce.task.reduce.Fetcher.run(Fetcher.java:19= 3) > 2019-09-03 17:18:43,097 INFO [AsyncDispatcher event handler] org.apache.h= adoop.mapreduce.v2.app.job.impl.JobImpl: job_1566552310686_260041Job Transi= tioned from RUNNING to ERROR > 2019-09-03 17:18:43,099 INFO [AsyncDispatcher event handler] org.apache.h= adoop.mapreduce.v2.app.job > {noformat} > =C2=A0=C2=A0 > nodemanager's=C2=A0 log is like same with log in=C2=A0=C2=A0MAPREDUCE-686= 9. > the code in TaskAttemptImpl=C2=A0indicate the Invalid event: TA_TOO_MANY_= FETCH_FAILURE at SUCCESS_FINISHING_CONTAINER cause the=C2=A0job state turn = into error; what i confused is=C2=A0 > # what cause the appmater =C2=A0handle the=C2=A0TA_TOO_MANY_FETCH_FAILUR= E=C2=A0 event=C2=A0on SUCCESS_FINISHING_CONTAINER=EF=BC=8Cillegal event on = this state.=C2=A0 but some other can successfully transitioned from state S= UCCEEDED to FAILED on=C2=A0TA_TOO_MANY_FETCH_FAILURE=C2=A0 event. > # restart the nodemanager would solve the error in nm; the shuffle error= would fix too. what cause this phenomenon. > Correct me if I am wrong. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: mapreduce-issues-unsubscribe@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-help@hadoop.apache.org