hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "gaoyu (Jira)" <j...@apache.org>
Subject [jira] [Updated] (MAPREDUCE-7349) An unexpected node crash and delayed messages would fail the job
Date Thu, 03 Jun 2021 08:29:00 GMT

     [ https://issues.apache.org/jira/browse/MAPREDUCE-7349?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

gaoyu updated MAPREDUCE-7349:
-----------------------------
    Description: 
Related cluster configuration:
 * MAX_FETCH_FAILURES_NOTIFICATIONS is 3
 * NodeManager recovery is disabled

Bug scenario:
 # submit a wordcount job which contains 2 simple map tasks ({{map_0}} and {{map_1}}) and
1 simple reduce task ({{reduce_0}});
 # all map tasks were finished successfully and the AppMaster was notified;
 # the NodeManager which runs the map task {{map_1}} crashes;
 # the AppMaster schedules a reduce attempt;
 # the reduce attempt sends {{statusUpdate}} message to AppMaster to notify a fetch failure;
 # the reduce attempt fails due to {{Shuffle$ShuffleError}} which was caused by {{java.io.IOException:
Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out}};
 # the reduce attempt send message {{fatalError}} to AppMaster
 # the AppMaster successively reschedules another three reduce attempts, but all of them were
failed due to {{Shuffle$ShuffleError}};
 # AppMaster fails the wordcount job due to the failed reduce task;
 # AppMaster receives three {{statusUpdate}} messages that state a fetch failure like the
message in step 5, but it has already failed the job and would not rerun the task {{map_1}}.
  
  

> An unexpected node crash and delayed messages would fail the job
> ----------------------------------------------------------------
>
>                 Key: MAPREDUCE-7349
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-7349
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: applicationmaster
>    Affects Versions: 3.2.2
>            Reporter: gaoyu
>            Priority: Major
>
> Related cluster configuration:
>  * MAX_FETCH_FAILURES_NOTIFICATIONS is 3
>  * NodeManager recovery is disabled
> Bug scenario:
>  # submit a wordcount job which contains 2 simple map tasks ({{map_0}} and {{map_1}})
and 1 simple reduce task ({{reduce_0}});
>  # all map tasks were finished successfully and the AppMaster was notified;
>  # the NodeManager which runs the map task {{map_1}} crashes;
>  # the AppMaster schedules a reduce attempt;
>  # the reduce attempt sends {{statusUpdate}} message to AppMaster to notify a fetch failure;
>  # the reduce attempt fails due to {{Shuffle$ShuffleError}} which was caused by {{java.io.IOException:
Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out}};
>  # the reduce attempt send message {{fatalError}} to AppMaster
>  # the AppMaster successively reschedules another three reduce attempts, but all of them
were failed due to {{Shuffle$ShuffleError}};
>  # AppMaster fails the wordcount job due to the failed reduce task;
>  # AppMaster receives three {{statusUpdate}} messages that state a fetch failure like
the message in step 5, but it has already failed the job and would not rerun the task {{map_1}}.
>   
>   



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: mapreduce-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-help@hadoop.apache.org


Mime
View raw message