hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Vinod Kumar Vavilapalli (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MAPREDUCE-4152) map task left hanging after AM dies trying to connect to RM
Date Wed, 02 May 2012 22:20:49 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-4152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13266960#comment-13266960
] 

Vinod Kumar Vavilapalli commented on MAPREDUCE-4152:
----------------------------------------------------

Going back and forth on this one, apologies.

So the situation is that RM went down somehow and AM exited without killing its tasks. This
is expected IIRC. Here's what I think:
 - When RM restart works, AMs should *never* exit because of connection issues. (Of course,
there is a corner case of AMs network itself being down, we should handle that somehow)
 - Even in the short term, if RM goes down and AM exits in the mean while, whenever RM is
back up, it will(should) kill all the containers of this application( by commanding the NMs
to do so).

Given above, I don't see why the AM needs to handle this specially. May be I am missing something?
                
> map task left hanging after AM dies trying to connect to RM
> -----------------------------------------------------------
>
>                 Key: MAPREDUCE-4152
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4152
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2
>    Affects Versions: 0.23.2
>            Reporter: Thomas Graves
>            Assignee: Thomas Graves
>         Attachments: MAPREDUCE-4152.patch, MAPREDUCE-4152.patch
>
>
> We had an instance where the RM went down for more then an hour.  The application master
exited with "Could not contact RM after 360000 milliseconds"
> 2012-04-11 10:43:36,040 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl:
job_1333003059741_15999Job Transitioned from RUNNING to ERROR

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message