hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jason Lowe (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MAPREDUCE-4235) Killing app can lead to inconsistent app status between RM and HS
Date Tue, 08 May 2012 22:49:51 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-4235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13270908#comment-13270908
] 

Jason Lowe commented on MAPREDUCE-4235:
---------------------------------------

The ApplicationMaster log will have this exception when it shuts down:

{noformat}
2012-05-08 16:19:34,666 ERROR [Thread-1] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator:
Exception while unregistering 
RemoteTrace: 
 at LocalTrace: 
	org.apache.hadoop.yarn.exceptions.impl.pb.YarnRemoteExceptionPBImpl: RemoteTrace: 
 at LocalTrace: 
	org.apache.hadoop.yarn.exceptions.impl.pb.YarnRemoteExceptionPBImpl: Application doesn't
exist in cache appattempt_1336511902223_0001_000001
	at org.apache.hadoop.yarn.factories.impl.pb.YarnRemoteExceptionFactoryPBImpl.createYarnRemoteException(YarnRemoteExceptionFactoryPBImpl.java:39)
	at org.apache.hadoop.yarn.ipc.RPCUtil.getRemoteException(RPCUtil.java:47)
	at org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.finishApplicationMaster(ApplicationMasterService.java:222)
	at org.apache.hadoop.yarn.api.impl.pb.service.AMRMProtocolPBServiceImpl.finishApplicationMaster(AMRMProtocolPBServiceImpl.java:69)
	at org.apache.hadoop.yarn.proto.AMRMProtocol$AMRMProtocolService$2.callBlockingMethod(AMRMProtocol.java:85)
	at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:427)
	at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:916)
	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1692)
	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:396)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1232)
	at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1687)

	at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
	at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
	at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
	at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
	at org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:90)
	at org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:57)
	at org.apache.hadoop.yarn.exceptions.impl.pb.YarnRemoteExceptionPBImpl.unwrapAndThrowException(YarnRemoteExceptionPBImpl.java:124)
	at org.apache.hadoop.yarn.api.impl.pb.client.AMRMProtocolPBClientImpl.finishApplicationMaster(AMRMProtocolPBClientImpl.java:85)
	at org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator.unregister(RMCommunicator.java:190)
	at org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator.stop(RMCommunicator.java:216)
	at org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.stop(RMContainerAllocator.java:226)
	at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$ContainerAllocatorRouter.stop(MRAppMaster.java:668)
	at org.apache.hadoop.yarn.service.CompositeService.stop(CompositeService.java:99)
	at org.apache.hadoop.yarn.service.CompositeService.stop(CompositeService.java:89)
	at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$MRAppMasterShutdownHook.run(MRAppMaster.java:1036)
	at org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:54)
{noformat}

I believe the following sequence of events leads to the problem:

# AM sees all tasks complete, changes internal job state to SUCCEEDED, triggers job finished
event (which currently waits 5 seconds and enlarges the race window)
# kill client command connects to the AM, sees that the job state != RUNNING, then tells RM
to kill application
# RM fields kill request, transitions app state from RUNNING to KILLED/KILLED and unregisters
app.  Leaves tracking URL unchanged (probably should null it out as it does for AM's that
exit unexpectedly)
# AM starts shutdown, tries to unregister with RM, and RM claims it doesn't know about the
app (because it already unregistered it internally)
# HS reports app status as SUCCEEDED because jhist file shows job completed successfully.

If the RM fields an unregister request for an application that was killed, we may want to
consider updating the application's status and tracking URL based on the unregister request
since it is likely to be more accurate (e.g.: SUCCEEDED instead of KILLED and tracking URL
would point to the history server).
                
> Killing app can lead to inconsistent app status between RM and HS
> -----------------------------------------------------------------
>
>                 Key: MAPREDUCE-4235
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4235
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2
>    Affects Versions: 0.23.3
>            Reporter: Jason Lowe
>
> If a client tries to kill an application that is about to complete, the application states
between the ResourceManager's web UI and the history server can be inconsistent.  When the
problem occurs, the ResourceManager shows the Status/FinalStatus as KILLED/KILLED and the
history link will redirect to a broken link.  The history link still references the ApplicationMaster
which is now missing.  The history server entry will show the application state as SUCCEEDED.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message