hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "lujie (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (YARN-9238) We get a wrong attempt by an appAttemptId when AM crash at some point
Date Fri, 25 Jan 2019 14:32:00 GMT

     [ https://issues.apache.org/jira/browse/YARN-9238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

lujie updated YARN-9238:
------------------------
    Description: 
We have found a data race that can make an odd situation.

See org.apache.hadoop.yarn.server.resourcemanager.OpportunisticContainerAllocatorAMService.OpportunisticAMSProcessor.allocate{color:#ff0000}:(code1){color}
{code:java}
     // Allocate OPPORTUNISTIC containers.
171.  SchedulerApplicationAttempt appAttempt =
172.    ((AbstractYarnScheduler)rmContext.getScheduler())
173.      .getApplicationAttempt(appAttemptId);
174.
175.  OpportunisticContainerContext oppCtx =
176.  appAttempt.getOpportunisticContainerContext();
177.  oppCtx.updateNodeList(getLeastLoadedNodes());
{code}
if we just crash the current AM(its attemptid is appattempt_0) just before code1#171, when
code1#171~173 continue to execute to get the appAttempt by appattempt_0, the appAttempt 
should represents the  currenct AM. But we found that the  appAttempt  represents  the
new AM and its attempid is appattempt_1. This appAttempt that represents  the new AM 
has not init its oppCtx, so NPE happnes at line code1#177.
{code:java}
java.lang.NullPointerException
at org.apache.hadoop.yarn.server.resourcemanager.OpportunisticContainerAllocatorAMService$OpportunisticAMSProcessor.allocate(OpportunisticContainerAllocatorAMService.java:177)
at org.apache.hadoop.yarn.server.resourcemanager.AMSProcessingChain.allocate(AMSProcessingChain.java:92)
at org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:424)
at org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:60)
at org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:530)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:943)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:878)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1876)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2830)
{code}
We have found the reason about we use old appattempt_0 but get the new appAttempt that represent
to new AM. Below code({color:#ff0000}code2{color}) is the function body of getApplicationAttempt 
at code1#173
{code:java}
399. public T getApplicationAttempt(ApplicationAttemptId applicationAttemptId) {
400   SchedulerApplication<T> app = applications.get(
401      applicationAttemptId.getApplicationId());
402   return app == null ? null : app.getCurrentAppAttempt();
403  }
{code}
when old AM Crash,  the CurrentAppAttempt of app will be setted as the new appAttempt
that presentes the new AM. So the code2 #402 will return the new appAttempt. 

if AM crashes just before code1, bug won't happens due to ApplicationDoesNotExistInCacheException.
AM crashed after code1, everything is also ok.

We shoud add the check: whether the the getted appAttempt have the same id with given id.

patch comes soon!

 

  was:
We have found a data race that can make an odd situation.

See org.apache.hadoop.yarn.server.resourcemanager.OpportunisticContainerAllocatorAMService.OpportunisticAMSProcessor.allocate{color:#ff0000}:(code1){color}
{code:java}
     // Allocate OPPORTUNISTIC containers.
171.  SchedulerApplicationAttempt appAttempt =
172.    ((AbstractYarnScheduler)rmContext.getScheduler())
173.      .getApplicationAttempt(appAttemptId);
174.
175.  OpportunisticContainerContext oppCtx =
176.  appAttempt.getOpportunisticContainerContext();
177.  oppCtx.updateNodeList(getLeastLoadedNodes());
{code}
if we just crash the current AM(its attemptid is appattempt_0) just before code1#171, when
the code of line code1#171~173 continue to execute to get the appAttempt by appattempt_0,
the appAttempt  should represents the  currenct AM. But we found that the  appAttempt 
represents  the new AM and its attempid is appattempt_1. This appAttempt that represents 
the new AM  has not init its oppCtx, so NPE happnes at line code1#177.
{code:java}
java.lang.NullPointerException
at org.apache.hadoop.yarn.server.resourcemanager.OpportunisticContainerAllocatorAMService$OpportunisticAMSProcessor.allocate(OpportunisticContainerAllocatorAMService.java:177)
at org.apache.hadoop.yarn.server.resourcemanager.AMSProcessingChain.allocate(AMSProcessingChain.java:92)
at org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:424)
at org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:60)
at org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:530)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:943)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:878)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1876)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2830)
{code}
We have found the reason about we use old appattempt_0 but get the new appAttempt that represent
to new AM. Below code({color:#ff0000}code2{color}) is the function body of getApplicationAttempt 
at code1#173
{code:java}
399. public T getApplicationAttempt(ApplicationAttemptId applicationAttemptId) {
400   SchedulerApplication<T> app = applications.get(
401      applicationAttemptId.getApplicationId());
402   return app == null ? null : app.getCurrentAppAttempt();
403  }
{code}
when old AM Crash,  the CurrentAppAttempt of app will be setted as the new appAttempt
that presentes the new AM. So the code2 #402 will return the new appAttempt. 

if AM crashes just before code1, bug won't happens due to ApplicationDoesNotExistInCacheException.
AM crashed after code1, everything is also ok.

We shoud add the check: whether the the getted appAttempt have the same id with given id.

patch comes soon!

 


> We get a wrong attempt  by an appAttemptId when AM crash at some point
> ----------------------------------------------------------------------
>
>                 Key: YARN-9238
>                 URL: https://issues.apache.org/jira/browse/YARN-9238
>             Project: Hadoop YARN
>          Issue Type: Bug
>            Reporter: lujie
>            Assignee: lujie
>            Priority: Critical
>         Attachments: YARN-9238_1.patch, hadoop-test-resourcemanager-hadoop11.log
>
>
> We have found a data race that can make an odd situation.
> See org.apache.hadoop.yarn.server.resourcemanager.OpportunisticContainerAllocatorAMService.OpportunisticAMSProcessor.allocate{color:#ff0000}:(code1){color}
> {code:java}
>      // Allocate OPPORTUNISTIC containers.
> 171.  SchedulerApplicationAttempt appAttempt =
> 172.    ((AbstractYarnScheduler)rmContext.getScheduler())
> 173.      .getApplicationAttempt(appAttemptId);
> 174.
> 175.  OpportunisticContainerContext oppCtx =
> 176.  appAttempt.getOpportunisticContainerContext();
> 177.  oppCtx.updateNodeList(getLeastLoadedNodes());
> {code}
> if we just crash the current AM(its attemptid is appattempt_0) just before code1#171,
when code1#171~173 continue to execute to get the appAttempt by appattempt_0, the appAttempt 
should represents the  currenct AM. But we found that the  appAttempt  represents  the
new AM and its attempid is appattempt_1. This appAttempt that represents  the new AM 
has not init its oppCtx, so NPE happnes at line code1#177.
> {code:java}
> java.lang.NullPointerException
> at org.apache.hadoop.yarn.server.resourcemanager.OpportunisticContainerAllocatorAMService$OpportunisticAMSProcessor.allocate(OpportunisticContainerAllocatorAMService.java:177)
> at org.apache.hadoop.yarn.server.resourcemanager.AMSProcessingChain.allocate(AMSProcessingChain.java:92)
> at org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:424)
> at org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:60)
> at org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99)
> at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:530)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:943)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:878)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1876)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2830)
> {code}
> We have found the reason about we use old appattempt_0 but get the new appAttempt that
represent to new AM. Below code({color:#ff0000}code2{color}) is the function body of getApplicationAttempt 
at code1#173
> {code:java}
> 399. public T getApplicationAttempt(ApplicationAttemptId applicationAttemptId) {
> 400   SchedulerApplication<T> app = applications.get(
> 401      applicationAttemptId.getApplicationId());
> 402   return app == null ? null : app.getCurrentAppAttempt();
> 403  }
> {code}
> when old AM Crash,  the CurrentAppAttempt of app will be setted as the new appAttempt
that presentes the new AM. So the code2 #402 will return the new appAttempt. 
> if AM crashes just before code1, bug won't happens due to ApplicationDoesNotExistInCacheException.
AM crashed after code1, everything is also ok.
> We shoud add the check: whether the the getted appAttempt have the same id with given id.
> patch comes soon!
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org


Mime
View raw message