hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Thomas Graves (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (MAPREDUCE-4152) map task left hanging after AM dies trying to connect to RM
Date Mon, 30 Apr 2012 20:45:49 GMT

     [ https://issues.apache.org/jira/browse/MAPREDUCE-4152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

Thomas Graves updated MAPREDUCE-4152:

    Status: Open  (was: Patch Available)

Canceling patch to address the comments.

Thanks for the review Vinod.  

>From what I could find the NM doesn't actually kill anything except when told to do so
or in the over the memory limit case.  Since the RM was down and the AM went away there was
no one to tell the NM to kill it. The thing that might make sense is to have the NM kill any
containers when it is gracefully shutting down and when its starting up (in case of the crash
case).  It might not make sense for it to just kill the containers immediately since those
containers could be running and finish just fine. When its shutting down its a bit easier
since it knows what containers are running, the starting up is a bit harder since the NM then
needs to know exactly what was running before it shut down and make sure it doesn't kill something
it shouldn't.   In this particular case when the RM comes back up, it tells the NM to reboot
so it would kill the containers at that point.

I think its a bit more of a corner case because normally the AM would have killed the task
or the task would have finished normally. But I will file the jiras for those.  Let me know
if you have additional thoughts.

I originally had a check in kill() to make sure it wasn't done but had somehow thought it
wasn't needed, perhaps I misread something, will look again. 

I will make the other changes. 
> map task left hanging after AM dies trying to connect to RM
> -----------------------------------------------------------
>                 Key: MAPREDUCE-4152
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4152
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2
>    Affects Versions: 0.23.2
>            Reporter: Thomas Graves
>            Assignee: Thomas Graves
>         Attachments: MAPREDUCE-4152.patch
> We had an instance where the RM went down for more then an hour.  The application master
exited with "Could not contact RM after 360000 milliseconds"
> 2012-04-11 10:43:36,040 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl:
job_1333003059741_15999Job Transitioned from RUNNING to ERROR

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira


View raw message