hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "lujie (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (YARN-9238) Allocate on previous or removed or non existent application attempt
Date Fri, 22 Feb 2019 07:32:00 GMT

     [ https://issues.apache.org/jira/browse/YARN-9238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

lujie updated YARN-9238:
------------------------
    Description: 
See org.apache.hadoop.yarn.server.resourcemanager.OpportunisticContainerAllocatorAMService.OpportunisticAMSProcessor.allocate
{code:java}
     // Allocate OPPORTUNISTIC containers.
171.  SchedulerApplicationAttempt appAttempt =
172.    ((AbstractYarnScheduler)rmContext.getScheduler())
173.      .getApplicationAttempt(appAttemptId);
174.
175.  OpportunisticContainerContext oppCtx =
176.  appAttempt.getOpportunisticContainerContext();
177.  oppCtx.updateNodeList(getLeastLoadedNodes());
{code}
if "allocate" arrive at line#171 and MRAppmaster crashes, ResourceManager will start the new
appAttempt and do 
{code:java}
org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplication.setCurrentAppAttempt(T
currentAttempt){
    this.currentAttempt = currentAttempt;
}{code}
the new appAttmept hasn't init its field OpportunisticContainerContext , hence oopCtx ==null
and null pointer happens at line 177
{code:java}
java.lang.NullPointerException
at org.apache.hadoop.yarn.server.resourcemanager.OpportunisticContainerAllocatorAMService$OpportunisticAMSProcessor.allocate(OpportunisticContainerAllocatorAMService.java:177)
at org.apache.hadoop.yarn.server.resourcemanager.AMSProcessingChain.allocate(AMSProcessingChain.java:92)
at org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:424)
at org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:60)
at org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:530)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:943)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:878)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1876)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2830) {code}

  was:
We have found a data race that can make an odd situation.

See org.apache.hadoop.yarn.server.resourcemanager.OpportunisticContainerAllocatorAMService.OpportunisticAMSProcessor.allocate{color:#ff0000}:(code1){color}
{code:java}
     // Allocate OPPORTUNISTIC containers.
171.  SchedulerApplicationAttempt appAttempt =
172.    ((AbstractYarnScheduler)rmContext.getScheduler())
173.      .getApplicationAttempt(appAttemptId);
174.
175.  OpportunisticContainerContext oppCtx =
176.  appAttempt.getOpportunisticContainerContext();
177.  oppCtx.updateNodeList(getLeastLoadedNodes());
{code}
if we just crash the current AM(its attemptid is appattempt_0) just before code1#171, when
code1#171~173 continue to execute to get the appAttempt by appattempt_0, the obtained appAttempt 
should represent the  currenct AM. But we found that the obtained appAttempt  represents 
the new AM and its attempid is appattempt_1. This  obtained appAttempt  has not init its oppCtx,
so NPE happnes at line code1#177.
{code:java}
java.lang.NullPointerException
at org.apache.hadoop.yarn.server.resourcemanager.OpportunisticContainerAllocatorAMService$OpportunisticAMSProcessor.allocate(OpportunisticContainerAllocatorAMService.java:177)
at org.apache.hadoop.yarn.server.resourcemanager.AMSProcessingChain.allocate(AMSProcessingChain.java:92)
at org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:424)
at org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:60)
at org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:530)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:943)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:878)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1876)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2830)
{code}
So why old appAttempt  disappeares and  why we use old appattempt_0 but get the new appAttempt

We have found the reason. Below code({color:#ff0000}code2{color}) is the function body of getApplicationAttempt 
at code1#173
{code:java}
399. public T getApplicationAttempt(ApplicationAttemptId applicationAttemptId) {
400   SchedulerApplication<T> app = applications.get(
401      applicationAttemptId.getApplicationId());
402   return app == null ? null : app.getCurrentAppAttempt();
403  }
{code}
when old AM Crash,  new AM and new appAttempt comes.  The currentAttempt of app will
be setted as the new appAttempt (see code3). So the code2 #402 will return the new appAttempt. 

if AM crashes at the head of allocate function(code1), bug won't happens due to ApplicationDoesNotExistInCacheException.
AM crashed after code1, everything is also ok.

We shoud add the check: whether the the getted appAttempt have the same id with given id.

patch comes soon!

{color:#ff0000}code3{color}
{code:java}
org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplication.setCurrentAppAttempt(T
currentAttempt){
    this.currentAttempt = currentAttempt;
}
{code}
 


> Allocate on previous or removed or non existent application attempt
> -------------------------------------------------------------------
>
>                 Key: YARN-9238
>                 URL: https://issues.apache.org/jira/browse/YARN-9238
>             Project: Hadoop YARN
>          Issue Type: Bug
>            Reporter: lujie
>            Assignee: lujie
>            Priority: Critical
>         Attachments: YARN-9238_1.patch, YARN-9238_2.patch, YARN-9238_3.patch, hadoop-test-resourcemanager-hadoop11.log
>
>
> See org.apache.hadoop.yarn.server.resourcemanager.OpportunisticContainerAllocatorAMService.OpportunisticAMSProcessor.allocate
> {code:java}
>      // Allocate OPPORTUNISTIC containers.
> 171.  SchedulerApplicationAttempt appAttempt =
> 172.    ((AbstractYarnScheduler)rmContext.getScheduler())
> 173.      .getApplicationAttempt(appAttemptId);
> 174.
> 175.  OpportunisticContainerContext oppCtx =
> 176.  appAttempt.getOpportunisticContainerContext();
> 177.  oppCtx.updateNodeList(getLeastLoadedNodes());
> {code}
> if "allocate" arrive at line#171 and MRAppmaster crashes, ResourceManager will start
the new appAttempt and do 
> {code:java}
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplication.setCurrentAppAttempt(T
currentAttempt){
>     this.currentAttempt = currentAttempt;
> }{code}
> the new appAttmept hasn't init its field OpportunisticContainerContext , hence oopCtx
==null and null pointer happens at line 177
> {code:java}
> java.lang.NullPointerException
> at org.apache.hadoop.yarn.server.resourcemanager.OpportunisticContainerAllocatorAMService$OpportunisticAMSProcessor.allocate(OpportunisticContainerAllocatorAMService.java:177)
> at org.apache.hadoop.yarn.server.resourcemanager.AMSProcessingChain.allocate(AMSProcessingChain.java:92)
> at org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:424)
> at org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:60)
> at org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99)
> at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:530)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:943)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:878)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1876)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2830) {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org


Mime
View raw message