hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "zhihai xu (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MAPREDUCE-6439) AM may fail instead of retrying if RM is restarting/shutting down during the allocate call
Date Tue, 04 Aug 2015 08:59:05 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-6439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14653302#comment-14653302
] 

zhihai xu commented on MAPREDUCE-6439:
--------------------------------------

Thanks for working on this issue [~adhoot]! Based on the stack trace, it looks like this issue
is because AM(YARN IPC Client) can't differentiate between local YarnRuntimeException and
remote YarnRuntimeException.
It looks like we can either fix it at the client side(AM) or at the serve side(RM).
Your attached patch is a client side fix, which replaces local YarnRuntimeException with a
new exception RMContainerAllocationException. I think the patch will work, but I feel it looks
more like a workaround

It looks like there are several ways to fix it at the serve side(RM).
I can think of the following two server side fixes.
# Similar as Anubhav's first option, Translates YarnRuntimeException to YarnException by calling
RPCUtil#getRemoteException. Maybe
we can do the translations at ApplicationMasterProtocolPBServiceImpl#allocate
{code}
  public AllocateResponseProto allocate(RpcController arg0,
      AllocateRequestProto proto) throws ServiceException {
    AllocateRequestPBImpl request = new AllocateRequestPBImpl(proto);
    try {
      AllocateResponse response = real.allocate(request);
      return ((AllocateResponsePBImpl)response).getProto();
    } catch (YarnException e) {
      throw new ServiceException(e);
    } catch (YarnRuntimeException e) {
      throw new ServiceException(RPCUtil.getRemoteException(e));
    } catch (IOException e) {
      throw new ServiceException(e);
    }
  }
{code}
YARN-635 Rename YarnRemoteException to YarnException and  YarnException to YarnRuntimeException.
So this change looks like more backward compatible and more generalized.

# Don't translate InterruptedException to YarnRuntimeException at AsyncDispatcher. Currently
the InterruptedException is hidden in AsyncDispatcher. The stack trace shows the YarnRuntimeException
from AsyncDispatcher is sent to AM. Because lots of code uses AsyncDispatcher. This change
may not be easy and safe.
{code}
      } catch (InterruptedException e) {
        if (!stopped) {
          LOG.warn("AsyncDispatcher thread interrupted", e);
        }
        // Need to reset drained flag to true if event queue is empty,
        // otherwise dispatcher will hang on stop.
        drained = eventQueue.isEmpty();
        throw new YarnRuntimeException(e);
      }
{code}

[~jlowe] - do you think any of my earlier suggestions are reasonable?

> AM may fail instead of retrying if RM is restarting/shutting down during the allocate
call 
> -------------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-6439
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6439
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>            Reporter: Anubhav Dhoot
>            Assignee: Anubhav Dhoot
>         Attachments: MAPREDUCE-6439.001.patch
>
>
> We are seeing cases where MR AM gets a YarnRuntimeException thats thrown in RM and gets
sent back to AM causing it to think that it has exhausted the number of retries. Copying the
error which causes the heartbeat thread to quit.
> {noformat}
> 2015-07-25 20:07:27,346 ERROR [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator:
Error communicating with RM: java.lang.InterruptedException
> 	at org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:245)
> 	at org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:469)
> 	at org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:60)
> 	at org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99)
> 	at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:587)
> 	at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1026)
> 	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)
> 	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)
> 	at java.security.AccessController.doPrivileged(Native Method)
> 	at javax.security.auth.Subject.doAs(Subject.java:415)
> 	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1642)
> 	at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)
> Caused by: java.lang.InterruptedException
> 	at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterruptibly(AbstractQueuedSynchronizer.java:1219)
> 	at java.util.concurrent.locks.ReentrantLock.lockInterruptibly(ReentrantLock.java:340)
> 	at java.util.concurrent.LinkedBlockingQueue.put(LinkedBlockingQueue.java:338)
> 	at org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:240)
> 	... 11 more
> org.apache.hadoop.yarn.exceptions.YarnRuntimeException: java.lang.InterruptedException
> 	at org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:245)
> 	at org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:469)
> 	at org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:60)
> 	at org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99)
> 	at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:587)
> 	at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1026)
> 	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)
> 	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)
> 	at java.security.AccessController.doPrivileged(Native Method)
> 	at javax.security.auth.Subject.doAs(Subject.java:415)
> 	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1642)
> 	at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)
> Caused by: java.lang.InterruptedException
> 	at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterruptibly(AbstractQueuedSynchronizer.java:1219)
> 	at java.util.concurrent.locks.ReentrantLock.lockInterruptibly(ReentrantLock.java:340)
> 	at java.util.concurrent.LinkedBlockingQueue.put(LinkedBlockingQueue.java:338)
> 	at org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:240)
> 	... 11 more
> 	at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
> 	at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
> 	at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> 	at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
> 	at org.apache.hadoop.yarn.ipc.RPCUtil.instantiateException(RPCUtil.java:53)
> 	at org.apache.hadoop.yarn.ipc.RPCUtil.unwrapAndThrowException(RPCUtil.java:107)
> 	at org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate(ApplicationMasterProtocolPBClientImpl.java:79)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> 	at java.lang.reflect.Method.invoke(Method.java:606)
> 	at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
> 	at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
> 	at com.sun.proxy.$Proxy36.allocate(Unknown Source)
> 	at org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor.makeRemoteRequest(RMContainerRequestor.java:188)
> 	at org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.getResources(RMContainerAllocator.java:667)
> 	at org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.heartbeat(RMContainerAllocator.java:244)
> 	at org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator$1.run(RMCommunicator.java:282)
> 	at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.yarn.exceptions.YarnRuntimeException):
java.lang.InterruptedException
> 	at org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:245)
> 	at org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:469)
> 	at org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:60)
> 	at org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99)
> 	at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:587)
> 	at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1026)
> 	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)
> 	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)
> 	at java.security.AccessController.doPrivileged(Native Method)
> 	at javax.security.auth.Subject.doAs(Subject.java:415)
> 	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1642)
> 	at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)
> Caused by: java.lang.InterruptedException
> 	at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterruptibly(AbstractQueuedSynchronizer.java:1219)
> 	at java.util.concurrent.locks.ReentrantLock.lockInterruptibly(ReentrantLock.java:340)
> 	at java.util.concurrent.LinkedBlockingQueue.put(LinkedBlockingQueue.java:338)
> 	at org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:240)
> 	... 11 more
> 	at org.apache.hadoop.ipc.Client.call(Client.java:1411)
> 	at org.apache.hadoop.ipc.Client.call(Client.java:1364)
> 	at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
> 	at com.sun.proxy.$Proxy35.allocate(Unknown Source)
> 	at org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate(ApplicationMasterProtocolPBClientImpl.java:77)
> 	... 12 more
> {noformat} 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message