hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ming Ma (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MAPREDUCE-5891) Improved shuffle error handling across NM restarts
Date Tue, 02 Sep 2014 16:00:23 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-5891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14118290#comment-14118290

Ming Ma commented on MAPREDUCE-5891:

Thanks, Junping, Jason for the useful patch.

In the case slowstart is set to some small value, the reducer will fetch some mapper output
and wait for the rest. Is it possible Fetcher.retryStartTime is set to some old value due
to early NM host A restart, and thus mark fetcher retry timed out when it later tries to handle
NM host B restart?

To make sure fetcher doesn't unnecessarily retry for the decommission scenario, it seems the
assumption is we will have some sort of graceful decommission support so that during decommission
process the fetcher will still be able to get mapper output. Is it true?

If we get time to do YARN-1593, that will further reduce the chance of shuffle handler restart.
Any opinion on that?

> Improved shuffle error handling across NM restarts
> --------------------------------------------------
>                 Key: MAPREDUCE-5891
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5891
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>    Affects Versions: 2.5.0
>            Reporter: Jason Lowe
>            Assignee: Junping Du
>         Attachments: MAPREDUCE-5891-demo.patch, MAPREDUCE-5891-v2.patch, MAPREDUCE-5891-v3.patch,
> To minimize the number of map fetch failures reported by reducers across an NM restart
it would be nice if reducers only reported a fetch failure after trying for at specified period
of time to retrieve the data.

This message was sent by Atlassian JIRA

View raw message