hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Varun Saxena (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MAPREDUCE-6513) MR job got hanged forever when one NM unstable for some time
Date Fri, 08 Apr 2016 19:37:25 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-6513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15232778#comment-15232778

Varun Saxena commented on MAPREDUCE-6513:

For checkstyle issue to be fixed I would need to change indentation of surrounding code which
is not required to be changed. So I have left it as it is.

Regarding checking for priority as compared to rescheduled event, well the priority is set
in RMContainerAllocator. In TestMRApp, there is a custom allocator so we cannot check that.
We can however check ContainerRequestEvent and see if the flag for earlier map task-attempt
failed is set or not. If its set RMContainerAllocator will set the priority of next map task
to 5.
And we have coverage in TestRMContainerAllocator for that part of the flow.

> MR job got hanged forever when one NM unstable for some time
> ------------------------------------------------------------
>                 Key: MAPREDUCE-6513
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6513
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: applicationmaster, resourcemanager
>    Affects Versions: 2.7.0
>            Reporter: Bob.zhao
>            Assignee: Varun Saxena
>            Priority: Critical
>         Attachments: MAPREDUCE-6513.01.patch, MAPREDUCE-6513.02.patch, MAPREDUCE-6513.03.patch
> when job is in-progress which is having more tasks,one node became unstable due to some
OS issue.After the node became unstable, the map on this node status changed to KILLED state.

> Currently maps which were running on unstable node are rescheduled, and all are in scheduled
state and wait for RM assign container.Seen ask requests for map till Node is good (all those
failed), there are no ask request after this. But AM keeps on preempting the reducers (it's
> Finally reducers are waiting for complete mappers and mappers did n't get container..
> My Question Is:
> ============
> why map requests did not sent AM ,once after node recovery.?

This message was sent by Atlassian JIRA

View raw message