hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Thomas Graves (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MAPREDUCE-4152) map task left hanging after AM dies trying to connect to RM
Date Thu, 03 May 2012 13:04:51 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-4152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13267415#comment-13267415

Thomas Graves commented on MAPREDUCE-4152:

* I agree, but RM restart doesn't work, AM currently times out (See RMContainerAllocator -
conf setting MR_AM_TO_RM_WAIT_INTERVAL_MS.), so I think it should clean up.  When RM restart
is implemented, the timeout of AM can possibly be removed and it won't cleanup.  The killing
of its task on shutdown being there won't hurt anything.

*  The RM does not kill all containers that were running because it doesn't know what containers
were running. On restart it loses everything.  Also when the RM does come back up, it tells
all the node managers that heart beat in to reboot, so they lose the containers also.

> map task left hanging after AM dies trying to connect to RM
> -----------------------------------------------------------
>                 Key: MAPREDUCE-4152
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4152
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2
>    Affects Versions: 0.23.2
>            Reporter: Thomas Graves
>            Assignee: Thomas Graves
>         Attachments: MAPREDUCE-4152.patch, MAPREDUCE-4152.patch
> We had an instance where the RM went down for more then an hour.  The application master
exited with "Could not contact RM after 360000 milliseconds"
> 2012-04-11 10:43:36,040 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl:
job_1333003059741_15999Job Transitioned from RUNNING to ERROR

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira


View raw message