hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "gaoyu (Jira)" <j...@apache.org>
Subject [jira] [Comment Edited] (MAPREDUCE-7349) An unexpected node crash and delayed messages would fail the job
Date Thu, 03 Jun 2021 08:29:00 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-7349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17356284#comment-17356284
] 

gaoyu edited comment on MAPREDUCE-7349 at 6/3/21, 8:28 AM:
-----------------------------------------------------------

i


was (Author: gy_way):
Related cluster configuration:
 * MAX_FETCH_FAILURES_NOTIFICATIONS is 3
 * NodeManager recovery is disabled

Bug scenario:
 # submit a wordcount job which contains 2 simple map tasks ({{map_0}} and {{map_1}}) and
1 simple reduce task ({{reduce_0}});
 # all map tasks were finished successfully and the AppMaster was notified;
 # the NodeManager which runs the map task {{map_1}} crashes;
 # the AppMaster schedules a reduce attempt;
 # the reduce attempt sends {{statusUpdate}} message to AppMaster to notify a fetch failure;
 # the reduce attempt fails due to {{Shuffle$ShuffleError}} which was caused by {{java.io.IOException:
Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out}};
 # the reduce attempt send message {{fatalError}} to AppMaster
 # the AppMaster successively reschedules another three reduce attempts, but all of them were
failed due to {{Shuffle$ShuffleError}};
 # AppMaster fails the wordcount job due to the failed reduce task;
 # AppMaster receives three {{statusUpdate}} messages that state a fetch failure like the
message in step 5, but it has already failed the job and would not rerun the task {{map_1}}.
  
  

> An unexpected node crash and delayed messages would fail the job
> ----------------------------------------------------------------
>
>                 Key: MAPREDUCE-7349
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-7349
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: applicationmaster
>    Affects Versions: 3.2.2
>            Reporter: gaoyu
>            Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: mapreduce-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-help@hadoop.apache.org


Mime
View raw message