Return-Path: X-Original-To: apmail-hadoop-mapreduce-issues-archive@minotaur.apache.org Delivered-To: apmail-hadoop-mapreduce-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id A9DAC10AFE for ; Fri, 16 Oct 2015 10:12:05 +0000 (UTC) Received: (qmail 76082 invoked by uid 500); 16 Oct 2015 10:12:05 -0000 Delivered-To: apmail-hadoop-mapreduce-issues-archive@hadoop.apache.org Received: (qmail 76021 invoked by uid 500); 16 Oct 2015 10:12:05 -0000 Mailing-List: contact mapreduce-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: mapreduce-issues@hadoop.apache.org Delivered-To: mailing list mapreduce-issues@hadoop.apache.org Received: (qmail 76007 invoked by uid 99); 16 Oct 2015 10:12:05 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 16 Oct 2015 10:12:05 +0000 Date: Fri, 16 Oct 2015 10:12:05 +0000 (UTC) From: "Varun Saxena (JIRA)" To: mapreduce-issues@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (MAPREDUCE-6513) MR job got hanged forever when one NM unstable for some time MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/MAPREDUCE-6513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14960468#comment-14960468 ] Varun Saxena commented on MAPREDUCE-6513: ----------------------------------------- Took logs for analysis from Bob offline. The scenario is as under : 1. All the maps have completed. {panel} 2015-10-13 04:38:42,229 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: *After Scheduling:* PendingReds:0 {color:red}ScheduledMaps:0{color} ScheduledReds:651 {color:red}AssignedMaps:0{color} AssignedReds:0 {color:red}CompletedMaps:78{color} CompletedReds:0 ContAlloc:79 ContRel:1 HostLocal:64 RackLocal:14 {panel} 2. One node becomes unstable and hence some of the succeeded map tasks which ran on that node are killed {noformat} 2015-10-13 04:53:41,127 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: TaskAttempt killed because it ran on unusable node hdszzdcxdat6g05u06p:26009. AttemptId:attempt_1437451211867_1485_m_000077_0 2015-10-13 04:53:41,128 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: TaskAttempt killed because it ran on unusable node hdszzdcxdat6g05u06p:26009. AttemptId:attempt_1437451211867_1485_m_000026_0 2015-10-13 04:53:41,128 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: TaskAttempt killed because it ran on unusable node hdszzdcxdat6g05u06p:26009. AttemptId:attempt_1437451211867_1485_m_000007_0 2015-10-13 04:53:41,128 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: TaskAttempt killed because it ran on unusable node hdszzdcxdat6g05u06p:26009. AttemptId:attempt_1437451211867_1485_m_000034_0 2015-10-13 04:53:41,128 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: TaskAttempt killed because it ran on unusable node hdszzdcxdat6g05u06p:26009. AttemptId:attempt_1437451211867_1485_m_000015_0 2015-10-13 04:53:41,128 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: TaskAttempt killed because it ran on unusable node hdszzdcxdat6g05u06p:26009. AttemptId:attempt_1437451211867_1485_m_000036_0 {noformat} 3. As can be seen below 16 maps are now scheduled {panel} 2015-10-13 04:53:42,128 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: *Before Scheduling:* PendingReds:0 {color:red}ScheduledMaps:16{color} ScheduledReds:651 {color:red}AssignedMaps:0{color} AssignedReds:0 {color:red}CompletedMaps:62{color} CompletedReds:0 ContAlloc:79 ContRel:1 HostLocal:64 RackLocal:14 {panel} 4. Node comes back up again after a while. 5. After this we keep on seeing that reducers keep on getting preempted, scheduled and this goes on and on in a cycle. And mappers are never assigned(due to lower priority). {noformat} 2015-10-13 04:38:40,219 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Before Scheduling: PendingReds:0 ScheduledMaps:0 ScheduledReds:651 AssignedMaps:2 AssignedReds:0 CompletedMaps:78 CompletedReds:0 ContAlloc:79 ContRel:1 HostLocal:64 RackLocal:14 2015-10-13 04:38:40,223 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After Scheduling: PendingReds:0 ScheduledMaps:0 ScheduledReds:651 AssignedMaps:1 AssignedReds:0 CompletedMaps:78 CompletedReds:0 ContAlloc:79 ContRel:1 HostLocal:64 RackLocal:14 2015-10-13 04:38:42,229 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After Scheduling: PendingReds:0 ScheduledMaps:0 ScheduledReds:651 AssignedMaps:0 AssignedReds:0 CompletedMaps:78 CompletedReds:0 ContAlloc:79 ContRel:1 HostLocal:64 RackLocal:14 2015-10-13 04:53:42,128 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Before Scheduling: PendingReds:0 ScheduledMaps:16 ScheduledReds:651 AssignedMaps:0 AssignedReds:0 CompletedMaps:62 CompletedReds:0 ContAlloc:79 ContRel:1 HostLocal:64 RackLocal:14 2015-10-13 04:53:42,132 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After Scheduling: PendingReds:651 ScheduledMaps:16 ScheduledReds:0 AssignedMaps:0 AssignedReds:0 CompletedMaps:62 CompletedReds:0 ContAlloc:79 ContRel:1 HostLocal:64 RackLocal:14 2015-10-13 04:54:49,433 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After Scheduling: PendingReds:651 ScheduledMaps:16 ScheduledReds:0 AssignedMaps:0 AssignedReds:0 CompletedMaps:62 CompletedReds:0 ContAlloc:84 ContRel:6 HostLocal:64 RackLocal:14 2015-10-13 04:54:50,451 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After Scheduling: PendingReds:651 ScheduledMaps:16 ScheduledReds:0 AssignedMaps:0 AssignedReds:0 CompletedMaps:62 CompletedReds:0 ContAlloc:90 ContRel:12 HostLocal:64 RackLocal:14 2015-10-13 04:54:51,470 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After Scheduling: PendingReds:651 ScheduledMaps:16 ScheduledReds:0 AssignedMaps:0 AssignedReds:0 CompletedMaps:62 CompletedReds:0 ContAlloc:95 ContRel:17 HostLocal:64 RackLocal:14 2015-10-13 04:54:52,501 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After Scheduling: PendingReds:651 ScheduledMaps:16 ScheduledReds:0 AssignedMaps:0 AssignedReds:0 CompletedMaps:62 CompletedReds:0 ContAlloc:114 ContRel:36 HostLocal:64 RackLocal:14 2015-10-13 04:54:53,553 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After Scheduling: PendingReds:651 ScheduledMaps:16 ScheduledReds:0 AssignedMaps:0 AssignedReds:0 CompletedMaps:62 CompletedReds:0 ContAlloc:129 ContRel:51 HostLocal:64 RackLocal:14 2015-10-13 04:54:54,657 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After Scheduling: PendingReds:651 ScheduledMaps:16 ScheduledReds:0 AssignedMaps:0 AssignedReds:0 CompletedMaps:62 CompletedReds:0 ContAlloc:147 ContRel:69 HostLocal:64 RackLocal:14 2015-10-13 04:54:55,708 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After Scheduling: PendingReds:651 ScheduledMaps:16 ScheduledReds:0 AssignedMaps:0 AssignedReds:0 CompletedMaps:62 CompletedReds:0 ContAlloc:155 ContRel:77 HostLocal:64 RackLocal:14 ....... 2015-10-13 04:55:04,912 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After Scheduling: PendingReds:641 ScheduledMaps:16 ScheduledReds:10 AssignedMaps:0 AssignedReds:0 CompletedMaps:62 CompletedReds:0 ContAlloc:216 ContRel:138 HostLocal:64 RackLocal:14 2015-10-13 04:55:05,923 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After Scheduling: PendingReds:636 ScheduledMaps:16 ScheduledReds:10 AssignedMaps:0 AssignedReds:5 CompletedMaps:62 CompletedReds:0 ContAlloc:221 ContRel:138 HostLocal:64 RackLocal:14 2015-10-13 04:55:06,929 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After Scheduling: PendingReds:631 ScheduledMaps:16 ScheduledReds:9 AssignedMaps:0 AssignedReds:11 CompletedMaps:62 CompletedReds:0 ContAlloc:227 ContRel:138 HostLocal:64 RackLocal:14 2015-10-13 04:55:07,945 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After Scheduling: PendingReds:630 ScheduledMaps:16 ScheduledReds:4 AssignedMaps:0 AssignedReds:17 CompletedMaps:62 CompletedReds:0 ContAlloc:233 ContRel:138 HostLocal:64 RackLocal:14 2015-10-13 04:55:08,967 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After Scheduling: PendingReds:630 ScheduledMaps:16 ScheduledReds:0 AssignedMaps:0 AssignedReds:21 CompletedMaps:62 CompletedReds:0 ContAlloc:238 ContRel:139 HostLocal:64 RackLocal:14 2015-10-13 04:55:09,967 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Before Scheduling: PendingReds:641 ScheduledMaps:16 ScheduledReds:0 AssignedMaps:0 AssignedReds:21 CompletedMaps:62 CompletedReds:0 ContAlloc:238 ContRel:139 HostLocal:64 RackLocal:14 2015-10-13 04:55:09,979 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After Scheduling: PendingReds:641 ScheduledMaps:16 ScheduledReds:0 AssignedMaps:0 AssignedReds:16 CompletedMaps:62 CompletedReds:0 ContAlloc:253 ContRel:154 HostLocal:64 RackLocal:14 2015-10-13 04:55:11,013 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After Scheduling: PendingReds:641 ScheduledMaps:16 ScheduledReds:0 AssignedMaps:0 AssignedReds:10 CompletedMaps:62 CompletedReds:0 ContAlloc:260 ContRel:161 HostLocal:64 RackLocal:14 2015-10-13 04:55:12,013 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Before Scheduling: PendingReds:646 ScheduledMaps:16 ScheduledReds:0 AssignedMaps:0 AssignedReds:10 CompletedMaps:62 CompletedReds:0 ContAlloc:260 ContRel:161 HostLocal:64 RackLocal:14 2015-10-13 04:55:12,031 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After Scheduling: PendingReds:636 ScheduledMaps:16 ScheduledReds:10 AssignedMaps:0 AssignedReds:8 CompletedMaps:62 CompletedReds:0 ContAlloc:267 ContRel:168 HostLocal:64 RackLocal:14 2015-10-13 04:55:13,053 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After Scheduling: PendingReds:624 ScheduledMaps:16 ScheduledReds:15 AssignedMaps:0 AssignedReds:12 CompletedMaps:62 CompletedReds:0 ContAlloc:274 ContRel:168 HostLocal:64 RackLocal:14 2015-10-13 04:55:14,061 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After Scheduling: PendingReds:614 ScheduledMaps:16 ScheduledReds:18 AssignedMaps:0 AssignedReds:19 CompletedMaps:62 CompletedReds:0 ContAlloc:281 ContRel:168 HostLocal:64 RackLocal:14 .... 2015-10-13 04:58:18,813 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Before Scheduling: PendingReds:623 ScheduledMaps:16 ScheduledReds:0 AssignedMaps:0 AssignedReds:44 CompletedMaps:62 CompletedReds:0 ContAlloc:1372 ContRel:964 HostLocal:64 RackLocal:14 2015-10-13 04:58:18,830 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After Scheduling: PendingReds:623 ScheduledMaps:16 ScheduledReds:0 AssignedMaps:0 AssignedReds:43 CompletedMaps:62 CompletedReds:0 ContAlloc:1386 ContRel:978 HostLocal:64 RackLocal:14 2015-10-13 04:58:19,855 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After Scheduling: PendingReds:615 ScheduledMaps:16 ScheduledReds:8 AssignedMaps:0 AssignedReds:32 CompletedMaps:62 CompletedReds:0 ContAlloc:1394 ContRel:986 HostLocal:64 RackLocal:14 2015-10-13 04:58:20,877 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After Scheduling: PendingReds:614 ScheduledMaps:16 ScheduledReds:3 AssignedMaps:0 AssignedReds:38 CompletedMaps:62 CompletedReds:0 ContAlloc:1400 ContRel:986 HostLocal:64 RackLocal:14 2015-10-13 04:58:21,890 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After Scheduling: PendingReds:605 ScheduledMaps:16 ScheduledReds:9 AssignedMaps:0 AssignedReds:38 CompletedMaps:62 CompletedReds:0 ContAlloc:1405 ContRel:988 HostLocal:64 RackLocal:14 2015-10-13 04:58:22,897 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After Scheduling: PendingReds:605 ScheduledMaps:16 ScheduledReds:4 AssignedMaps:0 AssignedReds:43 CompletedMaps:62 CompletedReds:0 ContAlloc:1410 ContRel:988 HostLocal:64 RackLocal:14 ... 2015-10-13 04:58:18,813 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Before Scheduling: PendingReds:623 ScheduledMaps:16 ScheduledReds:0 AssignedMaps:0 AssignedReds:44 CompletedMaps:62 CompletedReds:0 ContAlloc:1372 ContRel:964 HostLocal:64 RackLocal:14 2015-10-13 04:58:18,830 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After Scheduling: PendingReds:623 ScheduledMaps:16 ScheduledReds:0 AssignedMaps:0 AssignedReds:43 CompletedMaps:62 CompletedReds:0 ContAlloc:1386 ContRel:978 HostLocal:64 RackLocal:14 2015-10-13 04:58:19,855 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After Scheduling: PendingReds:615 ScheduledMaps:16 ScheduledReds:8 AssignedMaps:0 AssignedReds:32 CompletedMaps:62 CompletedReds:0 ContAlloc:1394 ContRel:986 HostLocal:64 RackLocal:14 2015-10-13 04:58:20,877 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After Scheduling: PendingReds:614 ScheduledMaps:16 ScheduledReds:3 AssignedMaps:0 AssignedReds:38 CompletedMaps:62 CompletedReds:0 ContAlloc:1400 ContRel:986 HostLocal:64 RackLocal:14 2015-10-13 04:58:21,890 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After Scheduling: PendingReds:605 ScheduledMaps:16 ScheduledReds:9 AssignedMaps:0 AssignedReds:38 CompletedMaps:62 CompletedReds:0 ContAlloc:1405 ContRel:988 HostLocal:64 RackLocal:14 2015-10-13 04:58:22,897 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: After Scheduling: PendingReds:605 ScheduledMaps:16 ScheduledReds:4 AssignedMaps:0 AssignedReds:43 CompletedMaps:62 CompletedReds:0 ContAlloc:1410 ContRel:988 HostLocal:64 RackLocal:14 {noformat} > MR job got hanged forever when one NM unstable for some time > ------------------------------------------------------------ > > Key: MAPREDUCE-6513 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-6513 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: applicationmaster, resourcemanager > Affects Versions: 2.7.0 > Reporter: Bob > Assignee: Varun Saxena > Priority: Critical > > when job is in-progress which is having more tasks,one node became unstable due to some OS issue.After the node became unstable, the map on this node status changed to KILLED state. > Currently maps which were running on unstable node are rescheduled, and all are in scheduled state and wait for RM assign container.Seen ask requests for map till Node is good (all those failed), there are no ask request after this. But AM keeps on preempting the reducers (it's recycling). > Finally reducers are waiting for complete mappers and mappers did n't get container.. > My Question Is: > ============ > why map requests did not sent AM ,once after node recovery.? -- This message was sent by Atlassian JIRA (v6.3.4#6332)