hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ankit Malhotra (JIRA)" <j...@apache.org>
Subject [jira] [Created] (MAPREDUCE-6190) MR Job is stuck because of one mapper stuck in STARTING
Date Wed, 10 Dec 2014 20:30:12 GMT
Ankit Malhotra created MAPREDUCE-6190:
-----------------------------------------

             Summary: MR Job is stuck because of one mapper stuck in STARTING
                 Key: MAPREDUCE-6190
                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6190
             Project: Hadoop Map/Reduce
          Issue Type: Bug
            Reporter: Ankit Malhotra


Trying to figure out a weird issue we started seeing on our CDH5.1.0 cluster with map reduce
jobs on YARN.

We had a job stuck for hours because one of the mappers never started up fully. Basically,
the map task had 2 attempts, the first one failed and the AM tried to schedule a second one
and the second attempt was stuck on STATE: STARTING, STATUS: NEW. A node never got assigned
and the task along with the job was stuck indefinitely.

The AM logs had this being logged again and again:

{code}
2014-12-09 19:25:12,347 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator:
Ramping down 0
2014-12-09 19:25:13,352 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator:
Received completed container container_1408745633994_450952_02_003807
2014-12-09 19:25:13,352 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator:
Reduce preemption successful attempt_1408745633994_450952_r_000048_1000
2014-12-09 19:25:13,352 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator:
Ramping down all scheduled reduces:0
2014-12-09 19:25:13,352 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator:
Going to preempt 1
2014-12-09 19:25:13,353 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator:
Preempting attempt_1408745633994_450952_r_000050_1000
2014-12-09 19:25:13,353 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator:
Recalculating schedule, headroom=0
2014-12-09 19:25:13,353 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator:
completedMapPercent 0.99968 totalMemLimit:1722880 finalMapMemLimit:2560 finalReduceMemLimit:1720320
netScheduledMapMem:2560 netScheduledReduceMem:1722880
2014-12-09 19:25:13,353 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator:
Ramping down 0
2014-12-09 19:25:13,353 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator:
After Scheduling: PendingReds:77 ScheduledMaps:1 ScheduledReds:0 AssignedMaps:0 AssignedReds:673
CompletedMaps:3124 CompletedReds:0 ContAlloc:4789 ContRel:798 HostLocal:2944 RackLocal:155
2014-12-09 19:25:14,353 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator:
Before Scheduling: PendingReds:78 ScheduledMaps:1 ScheduledReds:0 AssignedMaps:0 AssignedReds:673
CompletedMaps:3124 CompletedReds:0 ContAlloc:4789 ContRel:798 HostLocal:2944 RackLocal:155
2014-12-09 19:25:14,359 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator:
Recalculating schedule, headroom=0
{code}

On killing the task manually, the AM started up the task again, scheduled and ran it successfully
completing the task and the job with it.

Some quick code grepping led us here:
http://grepcode.com/file/repo1.maven.org/maven2/org.apache.hadoop/hadoop-mapreduce-client-app/2.3.0/org/apache/hadoop/mapreduce/v2/app/rm/RMContainerAllocator.java#397

But still dont quite understand why this would happen once in a while and why the job would
suddenly be ok once the stuck task is manually killed.

Note: Other jobs succeed on the cluster while this job is stuck.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message