hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Varun Saxena (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MAPREDUCE-6513) MR job got hanged forever when one NM unstable for some time
Date Fri, 16 Oct 2015 10:12:05 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-6513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14960468#comment-14960468
] 

Varun Saxena commented on MAPREDUCE-6513:
-----------------------------------------

Took logs for analysis from Bob offline. The scenario is as under :
1. All the maps have completed.
{panel}
2015-10-13 04:38:42,229 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator:
*After Scheduling:* PendingReds:0 {color:red}ScheduledMaps:0{color} ScheduledReds:651 {color:red}AssignedMaps:0{color}
AssignedReds:0 {color:red}CompletedMaps:78{color} CompletedReds:0 ContAlloc:79 ContRel:1 HostLocal:64
RackLocal:14
{panel}
2. One node becomes unstable and hence some of the succeeded map tasks which ran on that node
are killed
{noformat}
2015-10-13 04:53:41,127 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl:
TaskAttempt killed because it ran on unusable node hdszzdcxdat6g05u06p:26009. AttemptId:attempt_1437451211867_1485_m_000077_0
2015-10-13 04:53:41,128 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl:
TaskAttempt killed because it ran on unusable node hdszzdcxdat6g05u06p:26009. AttemptId:attempt_1437451211867_1485_m_000026_0
2015-10-13 04:53:41,128 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl:
TaskAttempt killed because it ran on unusable node hdszzdcxdat6g05u06p:26009. AttemptId:attempt_1437451211867_1485_m_000007_0
2015-10-13 04:53:41,128 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl:
TaskAttempt killed because it ran on unusable node hdszzdcxdat6g05u06p:26009. AttemptId:attempt_1437451211867_1485_m_000034_0
2015-10-13 04:53:41,128 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl:
TaskAttempt killed because it ran on unusable node hdszzdcxdat6g05u06p:26009. AttemptId:attempt_1437451211867_1485_m_000015_0
2015-10-13 04:53:41,128 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl:
TaskAttempt killed because it ran on unusable node hdszzdcxdat6g05u06p:26009. AttemptId:attempt_1437451211867_1485_m_000036_0
{noformat}

3. As can be seen below 16 maps are now scheduled
{panel}
2015-10-13 04:53:42,128 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator:
*Before Scheduling:* PendingReds:0 {color:red}ScheduledMaps:16{color} ScheduledReds:651 {color:red}AssignedMaps:0{color}
AssignedReds:0 {color:red}CompletedMaps:62{color} CompletedReds:0 ContAlloc:79 ContRel:1 HostLocal:64
RackLocal:14
{panel}

4. Node comes back up again after a while.

5. After this we keep on seeing that reducers keep on getting preempted, scheduled and this
goes on and on in a cycle. And mappers are never assigned(due to lower priority).
{noformat}
2015-10-13 04:38:40,219 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator:
Before Scheduling: PendingReds:0 ScheduledMaps:0 ScheduledReds:651 AssignedMaps:2 AssignedReds:0
CompletedMaps:78 CompletedReds:0 ContAlloc:79 ContRel:1 HostLocal:64 RackLocal:14
2015-10-13 04:38:40,223 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator:
After Scheduling: PendingReds:0 ScheduledMaps:0 ScheduledReds:651 AssignedMaps:1 AssignedReds:0
CompletedMaps:78 CompletedReds:0 ContAlloc:79 ContRel:1 HostLocal:64 RackLocal:14
2015-10-13 04:38:42,229 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator:
After Scheduling: PendingReds:0 ScheduledMaps:0 ScheduledReds:651 AssignedMaps:0 AssignedReds:0
CompletedMaps:78 CompletedReds:0 ContAlloc:79 ContRel:1 HostLocal:64 RackLocal:14
2015-10-13 04:53:42,128 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator:
Before Scheduling: PendingReds:0 ScheduledMaps:16 ScheduledReds:651 AssignedMaps:0 AssignedReds:0
CompletedMaps:62 CompletedReds:0 ContAlloc:79 ContRel:1 HostLocal:64 RackLocal:14
2015-10-13 04:53:42,132 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator:
After Scheduling: PendingReds:651 ScheduledMaps:16 ScheduledReds:0 AssignedMaps:0 AssignedReds:0
CompletedMaps:62 CompletedReds:0 ContAlloc:79 ContRel:1 HostLocal:64 RackLocal:14
2015-10-13 04:54:49,433 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator:
After Scheduling: PendingReds:651 ScheduledMaps:16 ScheduledReds:0 AssignedMaps:0 AssignedReds:0
CompletedMaps:62 CompletedReds:0 ContAlloc:84 ContRel:6 HostLocal:64 RackLocal:14
2015-10-13 04:54:50,451 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator:
After Scheduling: PendingReds:651 ScheduledMaps:16 ScheduledReds:0 AssignedMaps:0 AssignedReds:0
CompletedMaps:62 CompletedReds:0 ContAlloc:90 ContRel:12 HostLocal:64 RackLocal:14
2015-10-13 04:54:51,470 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator:
After Scheduling: PendingReds:651 ScheduledMaps:16 ScheduledReds:0 AssignedMaps:0 AssignedReds:0
CompletedMaps:62 CompletedReds:0 ContAlloc:95 ContRel:17 HostLocal:64 RackLocal:14
2015-10-13 04:54:52,501 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator:
After Scheduling: PendingReds:651 ScheduledMaps:16 ScheduledReds:0 AssignedMaps:0 AssignedReds:0
CompletedMaps:62 CompletedReds:0 ContAlloc:114 ContRel:36 HostLocal:64 RackLocal:14
2015-10-13 04:54:53,553 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator:
After Scheduling: PendingReds:651 ScheduledMaps:16 ScheduledReds:0 AssignedMaps:0 AssignedReds:0
CompletedMaps:62 CompletedReds:0 ContAlloc:129 ContRel:51 HostLocal:64 RackLocal:14
2015-10-13 04:54:54,657 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator:
After Scheduling: PendingReds:651 ScheduledMaps:16 ScheduledReds:0 AssignedMaps:0 AssignedReds:0
CompletedMaps:62 CompletedReds:0 ContAlloc:147 ContRel:69 HostLocal:64 RackLocal:14
2015-10-13 04:54:55,708 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator:
After Scheduling: PendingReds:651 ScheduledMaps:16 ScheduledReds:0 AssignedMaps:0 AssignedReds:0
CompletedMaps:62 CompletedReds:0 ContAlloc:155 ContRel:77 HostLocal:64 RackLocal:14
.......

2015-10-13 04:55:04,912 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator:
After Scheduling: PendingReds:641 ScheduledMaps:16 ScheduledReds:10 AssignedMaps:0 AssignedReds:0
CompletedMaps:62 CompletedReds:0 ContAlloc:216 ContRel:138 HostLocal:64 RackLocal:14
2015-10-13 04:55:05,923 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator:
After Scheduling: PendingReds:636 ScheduledMaps:16 ScheduledReds:10 AssignedMaps:0 AssignedReds:5
CompletedMaps:62 CompletedReds:0 ContAlloc:221 ContRel:138 HostLocal:64 RackLocal:14
2015-10-13 04:55:06,929 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator:
After Scheduling: PendingReds:631 ScheduledMaps:16 ScheduledReds:9 AssignedMaps:0 AssignedReds:11
CompletedMaps:62 CompletedReds:0 ContAlloc:227 ContRel:138 HostLocal:64 RackLocal:14
2015-10-13 04:55:07,945 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator:
After Scheduling: PendingReds:630 ScheduledMaps:16 ScheduledReds:4 AssignedMaps:0 AssignedReds:17
CompletedMaps:62 CompletedReds:0 ContAlloc:233 ContRel:138 HostLocal:64 RackLocal:14
2015-10-13 04:55:08,967 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator:
After Scheduling: PendingReds:630 ScheduledMaps:16 ScheduledReds:0 AssignedMaps:0 AssignedReds:21
CompletedMaps:62 CompletedReds:0 ContAlloc:238 ContRel:139 HostLocal:64 RackLocal:14
2015-10-13 04:55:09,967 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator:
Before Scheduling: PendingReds:641 ScheduledMaps:16 ScheduledReds:0 AssignedMaps:0 AssignedReds:21
CompletedMaps:62 CompletedReds:0 ContAlloc:238 ContRel:139 HostLocal:64 RackLocal:14
2015-10-13 04:55:09,979 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator:
After Scheduling: PendingReds:641 ScheduledMaps:16 ScheduledReds:0 AssignedMaps:0 AssignedReds:16
CompletedMaps:62 CompletedReds:0 ContAlloc:253 ContRel:154 HostLocal:64 RackLocal:14
2015-10-13 04:55:11,013 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator:
After Scheduling: PendingReds:641 ScheduledMaps:16 ScheduledReds:0 AssignedMaps:0 AssignedReds:10
CompletedMaps:62 CompletedReds:0 ContAlloc:260 ContRel:161 HostLocal:64 RackLocal:14
2015-10-13 04:55:12,013 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator:
Before Scheduling: PendingReds:646 ScheduledMaps:16 ScheduledReds:0 AssignedMaps:0 AssignedReds:10
CompletedMaps:62 CompletedReds:0 ContAlloc:260 ContRel:161 HostLocal:64 RackLocal:14
2015-10-13 04:55:12,031 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator:
After Scheduling: PendingReds:636 ScheduledMaps:16 ScheduledReds:10 AssignedMaps:0 AssignedReds:8
CompletedMaps:62 CompletedReds:0 ContAlloc:267 ContRel:168 HostLocal:64 RackLocal:14
2015-10-13 04:55:13,053 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator:
After Scheduling: PendingReds:624 ScheduledMaps:16 ScheduledReds:15 AssignedMaps:0 AssignedReds:12
CompletedMaps:62 CompletedReds:0 ContAlloc:274 ContRel:168 HostLocal:64 RackLocal:14
2015-10-13 04:55:14,061 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator:
After Scheduling: PendingReds:614 ScheduledMaps:16 ScheduledReds:18 AssignedMaps:0 AssignedReds:19
CompletedMaps:62 CompletedReds:0 ContAlloc:281 ContRel:168 HostLocal:64 RackLocal:14
....

2015-10-13 04:58:18,813 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator:
Before Scheduling: PendingReds:623 ScheduledMaps:16 ScheduledReds:0 AssignedMaps:0 AssignedReds:44
CompletedMaps:62 CompletedReds:0 ContAlloc:1372 ContRel:964 HostLocal:64 RackLocal:14
2015-10-13 04:58:18,830 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator:
After Scheduling: PendingReds:623 ScheduledMaps:16 ScheduledReds:0 AssignedMaps:0 AssignedReds:43
CompletedMaps:62 CompletedReds:0 ContAlloc:1386 ContRel:978 HostLocal:64 RackLocal:14
2015-10-13 04:58:19,855 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator:
After Scheduling: PendingReds:615 ScheduledMaps:16 ScheduledReds:8 AssignedMaps:0 AssignedReds:32
CompletedMaps:62 CompletedReds:0 ContAlloc:1394 ContRel:986 HostLocal:64 RackLocal:14
2015-10-13 04:58:20,877 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator:
After Scheduling: PendingReds:614 ScheduledMaps:16 ScheduledReds:3 AssignedMaps:0 AssignedReds:38
CompletedMaps:62 CompletedReds:0 ContAlloc:1400 ContRel:986 HostLocal:64 RackLocal:14
2015-10-13 04:58:21,890 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator:
After Scheduling: PendingReds:605 ScheduledMaps:16 ScheduledReds:9 AssignedMaps:0 AssignedReds:38
CompletedMaps:62 CompletedReds:0 ContAlloc:1405 ContRel:988 HostLocal:64 RackLocal:14
2015-10-13 04:58:22,897 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator:
After Scheduling: PendingReds:605 ScheduledMaps:16 ScheduledReds:4 AssignedMaps:0 AssignedReds:43
CompletedMaps:62 CompletedReds:0 ContAlloc:1410 ContRel:988 HostLocal:64 RackLocal:14
...

2015-10-13 04:58:18,813 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator:
Before Scheduling: PendingReds:623 ScheduledMaps:16 ScheduledReds:0 AssignedMaps:0 AssignedReds:44
CompletedMaps:62 CompletedReds:0 ContAlloc:1372 ContRel:964 HostLocal:64 RackLocal:14
2015-10-13 04:58:18,830 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator:
After Scheduling: PendingReds:623 ScheduledMaps:16 ScheduledReds:0 AssignedMaps:0 AssignedReds:43
CompletedMaps:62 CompletedReds:0 ContAlloc:1386 ContRel:978 HostLocal:64 RackLocal:14
2015-10-13 04:58:19,855 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator:
After Scheduling: PendingReds:615 ScheduledMaps:16 ScheduledReds:8 AssignedMaps:0 AssignedReds:32
CompletedMaps:62 CompletedReds:0 ContAlloc:1394 ContRel:986 HostLocal:64 RackLocal:14
2015-10-13 04:58:20,877 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator:
After Scheduling: PendingReds:614 ScheduledMaps:16 ScheduledReds:3 AssignedMaps:0 AssignedReds:38
CompletedMaps:62 CompletedReds:0 ContAlloc:1400 ContRel:986 HostLocal:64 RackLocal:14
2015-10-13 04:58:21,890 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator:
After Scheduling: PendingReds:605 ScheduledMaps:16 ScheduledReds:9 AssignedMaps:0 AssignedReds:38
CompletedMaps:62 CompletedReds:0 ContAlloc:1405 ContRel:988 HostLocal:64 RackLocal:14
2015-10-13 04:58:22,897 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator:
After Scheduling: PendingReds:605 ScheduledMaps:16 ScheduledReds:4 AssignedMaps:0 AssignedReds:43
CompletedMaps:62 CompletedReds:0 ContAlloc:1410 ContRel:988 HostLocal:64 RackLocal:14
{noformat}

                                                                                         
                                      

> MR job got hanged forever when one NM unstable for some time
> ------------------------------------------------------------
>
>                 Key: MAPREDUCE-6513
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6513
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: applicationmaster, resourcemanager
>    Affects Versions: 2.7.0
>            Reporter: Bob
>            Assignee: Varun Saxena
>            Priority: Critical
>
> when job is in-progress which is having more tasks,one node became unstable due to some
OS issue.After the node became unstable, the map on this node status changed to KILLED state.

> Currently maps which were running on unstable node are rescheduled, and all are in scheduled
state and wait for RM assign container.Seen ask requests for map till Node is good (all those
failed), there are no ask request after this. But AM keeps on preempting the reducers (it's
recycling).
> Finally reducers are waiting for complete mappers and mappers did n't get container..
> My Question Is:
> ============
> why map requests did not sent AM ,once after node recovery.?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message