hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Eric Payne (Commented) (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MAPREDUCE-3186) User jobs are getting hanged if the Resource manager process goes down and comes up while job is getting executed.
Date Fri, 21 Oct 2011 17:28:32 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-3186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13132849#comment-13132849
] 

Eric Payne commented on MAPREDUCE-3186:
---------------------------------------

Sid wrote:
> AM talking to the RM – the AM currently logs an exception and
> continues (RMCommunicator.startAllocatorThread()). This should
> be fixed in the MRAppMaster based on the kind of exception
> (temporary timeout versus some kind of a kill response from the RM).
 
If I'm debugging this correctly, the RMCommunicator (via the startAllocationThread() method)
sends a heartbeat to the RM. This heartbeat does catch an UndeclaredThrowableException when
the RM is down, caused a ConnectException. The RMCommunicator sends the heartbeat about every
second (depending on the config option), and this exception is thrown during each heartbeat
as long as the RM is down. When the RM comes back up, however, exceptions stop being thrown
altogether.
 
I'm still investigating to see why no exception is thrown.
 
It seems that the "right" thing for this communication mechanism between the RM and the AM
to recognize that the AM is no longer valid and throw the appropriate exception so that the
AM can exit cleanly.

It looks like when the "rogue" MRAM contacts the RM, the RM is telling the AM to reboot, but
the RMAM is ignoring it.

I would say that on the MRAM side, RMContainerAllocator.getResource() calls RMContainerRequestor.makeRemoteRequest()
to get the response from the RM. At that point, RMContainerAllocator.getResource() should
check the reboot flag from the response and throw an exception, which should cause RMCommunicator
thread to exit.


                
> User jobs are getting hanged if the Resource manager process goes down and comes up while
job is getting executed.
> ------------------------------------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3186
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3186
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2
>    Affects Versions: 0.23.0
>         Environment: linux
>            Reporter: Ramgopal N
>            Assignee: Eric Payne
>              Labels: test
>
> If the resource manager is restarted while the job execution is in progress, the job
is getting hanged.
> UI shows the job as running.
> In the RM log, it is throwing an error "ERROR org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService:
AppAttemptId doesnt exist in cache appattempt_1318579738195_0004_000001"
> In the console MRAppMaster and Runjar processes are not getting killed

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

Mime
View raw message