apex-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sanjay M Pujare (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (APEXCORE-777) Application Master may not shutdown due to incorrect numRequestedContainers counting
Date Tue, 15 Aug 2017 18:07:00 GMT

    [ https://issues.apache.org/jira/browse/APEXCORE-777?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16126034#comment-16126034
] 

Sanjay M Pujare edited comment on APEXCORE-777 at 8/15/17 6:06 PM:
-------------------------------------------------------------------

This can be addressed as part of refactor JIRA APEXCORE-771. 

When Yarn returns 2 containers, the code processes the first allocated container and because
it is obviously not "already allocated" it does not touch the counters and then removes the
outstanding request from requestedResources Map. But later it determines the container is
not needed anymore so it creates a release-request for that container. While processing the
second allocated-container it cannot determine this to be "already allocated" case because
the request was removed from the requestedResources map and so numRequestedContainers doesn't
get incremented. It again figures out this container is not needed so creates a release-request
for this container. But then numRequestedContainers stays at -1 and that is the problem.

Basically a request should NEVER be removed from the requestedResources map, so that the code
can ascertain "already allocated" cases even in cases such as this. We should have additional
flags/states in the map to denote a request as "removed" (with reason for the removal) so
it is possible to match later allocations against these "removed" requests.


was (Author: sanjaypujare):
This should be addressed as part of refactor JIRA APEXCORE-771. 

When Yarn returns 2 containers, the code processes the first allocated container and because
it is obviously not "already allocated" it does not touch the counters and then removes the
outstanding request from requestedResources Map. But later it determines the container is
not needed anymore so it creates a release-request for that container. While processing the
second allocated-container it cannot determine this to be "already allocated" case because
the request was removed from the requestedResources map and so numRequestedContainers doesn't
get incremented. It again figures out this container is not needed so creates a release-request
for this container. But then numRequestedContainers stays at -1 and that is the problem.

Basically a request should NEVER be removed from the requestedResources map, so that the code
can ascertain "already allocated" cases even in cases such as this. We should have additional
flags/states in the map to denote a request as "removed" (with reason for the removal) so
it is possible to match later allocations against these "removed" requests.

> Application Master may not shutdown due to incorrect numRequestedContainers counting
> ------------------------------------------------------------------------------------
>
>                 Key: APEXCORE-777
>                 URL: https://issues.apache.org/jira/browse/APEXCORE-777
>             Project: Apache Apex Core
>          Issue Type: Bug
>            Reporter: Vlad Rozov
>            Priority: Minor
>
> Consider a scenario where App master requests a container from Yarn (numRequestedContainers
= 1). There is not enough resources and the request timeouts. My understanding is that App
master will re-request it again but the number of requested containers will not change (one
newly requested, one removed). Let's assume that App master, by the time Yarn responds back
decides that it does not need any. If Yarn responds with one allocated containers, numRequestedContainers
will go to 0 (correct), but Yarn may respond back with 2 allocated containers if by the time
App Master sends the second request it already allocated a container requested in the original
request (the one that timeouted) as Yarn does not guarantee that removed request is fullfilled
(see Yarn doc). Will not in this case numRequestedContainers be -1 due to the bulk decrement?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message