hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Weiwei Yang (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-9238) We get a wrong attempt by an appAttemptId when AM crash at some point
Date Mon, 28 Jan 2019 14:21:00 GMT

    [ https://issues.apache.org/jira/browse/YARN-9238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16754039#comment-16754039
] 

Weiwei Yang commented on YARN-9238:
-----------------------------------

Hi [~xiaoheipangzi]

This looks like similar to YARN-6959. Fix looks good. Just some small nits

1. OpportunisticContainerAllocatorAMService

Can we print a warning too when such case happens,
{code:java}
LOG.error("Calling allocate on previous or removed or non existent application attempt "
+ applicationAttemptId);
{code}
2. TestOpportunisticContainerAllocatorAMService#testAMCrashDuringAllocate

following code seems not be used and can be removed
{code:java}
+ final RecordFactory factory = RecordFactoryProvider.getRecordFactory(null);
+ AllocateRequest allReq =
+ (AllocateRequestPBImpl)factory.newRecordInstance(AllocateRequest.class);
+ allReq.setAskList(Arrays.asList(
+ ResourceRequest.newInstance(Priority.UNDEFINED, "a",
+ Resource.newInstance(1, 2), 1, true, "exp",
+ ExecutionTypeRequest.newInstance(ExecutionType.OPPORTUNISTIC, true))));

{code}
Thanks

> We get a wrong attempt  by an appAttemptId when AM crash at some point
> ----------------------------------------------------------------------
>
>                 Key: YARN-9238
>                 URL: https://issues.apache.org/jira/browse/YARN-9238
>             Project: Hadoop YARN
>          Issue Type: Bug
>            Reporter: lujie
>            Assignee: lujie
>            Priority: Critical
>         Attachments: YARN-9238_1.patch, hadoop-test-resourcemanager-hadoop11.log
>
>
> We have found a data race that can make an odd situation.
> See org.apache.hadoop.yarn.server.resourcemanager.OpportunisticContainerAllocatorAMService.OpportunisticAMSProcessor.allocate{color:#ff0000}:(code1){color}
> {code:java}
>      // Allocate OPPORTUNISTIC containers.
> 171.  SchedulerApplicationAttempt appAttempt =
> 172.    ((AbstractYarnScheduler)rmContext.getScheduler())
> 173.      .getApplicationAttempt(appAttemptId);
> 174.
> 175.  OpportunisticContainerContext oppCtx =
> 176.  appAttempt.getOpportunisticContainerContext();
> 177.  oppCtx.updateNodeList(getLeastLoadedNodes());
> {code}
> if we just crash the current AM(its attemptid is appattempt_0) just before code1#171,
when code1#171~173 continue to execute to get the appAttempt by appattempt_0, the obtained
appAttempt  should represent the  currenct AM. But we found that the obtained appAttempt 
represents  the new AM and its attempid is appattempt_1. This  obtained appAttempt  has
not init its oppCtx, so NPE happnes at line code1#177.
> {code:java}
> java.lang.NullPointerException
> at org.apache.hadoop.yarn.server.resourcemanager.OpportunisticContainerAllocatorAMService$OpportunisticAMSProcessor.allocate(OpportunisticContainerAllocatorAMService.java:177)
> at org.apache.hadoop.yarn.server.resourcemanager.AMSProcessingChain.allocate(AMSProcessingChain.java:92)
> at org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:424)
> at org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:60)
> at org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99)
> at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:530)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:943)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:878)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1876)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2830)
> {code}
> So why old appAttempt  disappeares and  why we use old appattempt_0 but get the new appAttempt
> We have found the reason. Below code({color:#ff0000}code2{color}) is the function body
of getApplicationAttempt  at code1#173
> {code:java}
> 399. public T getApplicationAttempt(ApplicationAttemptId applicationAttemptId) {
> 400   SchedulerApplication<T> app = applications.get(
> 401      applicationAttemptId.getApplicationId());
> 402   return app == null ? null : app.getCurrentAppAttempt();
> 403  }
> {code}
> when old AM Crash,  new AM and new appAttempt comes.  The currentAttempt of app will
be setted as the new appAttempt (see code3). So the code2 #402 will return the new appAttempt. 
> if AM crashes at the head of allocate function(code1), bug won't happens due to ApplicationDoesNotExistInCacheException.
AM crashed after code1, everything is also ok.
> We shoud add the check: whether the the getted appAttempt have the same id with given id.
> patch comes soon!
> {color:#ff0000}code3{color}
> {code:java}
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplication.setCurrentAppAttempt(T
currentAttempt){
>     this.currentAttempt = currentAttempt;
> }
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org


Mime
View raw message