hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jason Lowe (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MAPREDUCE-6329) Failure of start map task on NM cause job hang
Date Wed, 22 Apr 2015 18:23:59 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-6329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14507595#comment-14507595
] 

Jason Lowe commented on MAPREDUCE-6329:
---------------------------------------

The RM log shows the two map containers being allocated, container 3 terminating, then container
4 being allocated.  All of this seems normal with the map task failing and the AM requesting
a new container.  However this is the interesting part in the RM log:
{noformat}
2015-04-20,21:36:38,633 INFO org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl:
container_1428390739155_23973_01_000004 Container Transitioned from ALLOCATED to KILLED
2015-04-20,21:36:38,633 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt:
Completed container: container_1428390739155_23973_01_000004 in state: KILLED event:KILL
{noformat}

Note that the container was allocated yet killed before it was ACQUIRED.  That means the container
was never received by the AM.  That's why the AM was confused about receiving the completed
container -- it had never seen the container allocated in the first place.  So the next question:
is there anything in the RM log indicating why the container transitioned from ALLOCATED to
KILLED?  Was it preempted or...?

This seems like a bug in YARN.  The RM is telling the AM a container completed that it never
told the AM about before.  The completion info doesn't tell the AM enough to know, in the
general case, which of its requests this could correspond to and therefore which one it would
need to re-request if it still needs it.  If a container is killed before it is ACQUIRED then
the RM should not treat the corresponding ask for that container as being fulfilled.

> Failure of start map task on NM cause job hang
> ----------------------------------------------
>
>                 Key: MAPREDUCE-6329
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6329
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>    Affects Versions: 2.6.0
>            Reporter: Peng Zhang
>         Attachments: syslog.tgz, yarn-app.log
>
>
> During rolling update of NM, AM start of container on NM failed. 
> And then job hang there.
> Attach AM logs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message