hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Bikas Saha (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-1902) Allocation of too many containers when a second request is done with the same resource capability
Date Fri, 15 May 2015 23:30:02 GMT

    [ https://issues.apache.org/jira/browse/YARN-1902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14546377#comment-14546377
] 

Bikas Saha commented on YARN-1902:
----------------------------------

The AMRMClient was not written to automatically remove requests because it does not know which
requests will be matched to allocated containers. The explicit contract is for users of AMRMClient
to remove requests that have been matched to containers.

If we change that behavior to automatically remove requests then it may lead to issues where
2 entities are removing requests. 1) user 2) AMRMClient. So that change should only be made
in a different version of AMRMClient or else existing users will break.

In the worst case, if the AMRMClient (automatically) removes the wrong request then the application
will hang because the RM will not provide it the container that is needed. Not automatically
removing the request has the downside of getting additional containers that need to be released
by the application. We chose excess containers over hanging for the original implementation.


Excess containers should happen rarely because the user controls when AMRMClient heartbeats
to the RM and can do that after having removed all matched requests, so that the remote request
table reflects the current state of outstanding requests. There may still be a race condition
on the RM side that gives more containers. Excess containers can happen more often with AMRMClientAsync,
because it heartbeats at a regular schedule and can send more requests than really outstanding
if the heartbeat goes out before the user has removed the matched requests.


> Allocation of too many containers when a second request is done with the same resource
capability
> -------------------------------------------------------------------------------------------------
>
>                 Key: YARN-1902
>                 URL: https://issues.apache.org/jira/browse/YARN-1902
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: client
>    Affects Versions: 2.2.0, 2.3.0, 2.4.0
>            Reporter: Sietse T. Au
>            Assignee: Sietse T. Au
>              Labels: client
>         Attachments: YARN-1902.patch, YARN-1902.v2.patch, YARN-1902.v3.patch
>
>
> Regarding AMRMClientImpl
> Scenario 1:
> Given a ContainerRequest x with Resource y, when addContainerRequest is called z times
with x, allocate is called and at least one of the z allocated containers is started, then
if another addContainerRequest call is done and subsequently an allocate call to the RM, (z+1)
containers will be allocated, where 1 container is expected.
> Scenario 2:
> No containers are started between the allocate calls. 
> Analyzing debug logs of the AMRMClientImpl, I have found that indeed a (z+1) are requested
in both scenarios, but that only in the second scenario, the correct behavior is observed.
> Looking at the implementation I have found that this (z+1) request is caused by the structure
of the remoteRequestsTable. The consequence of Map<Resource, ResourceRequestInfo> is
that ResourceRequestInfo does not hold any information about whether a request has been sent
to the RM yet or not.
> There are workarounds for this, such as releasing the excess containers received.
> The solution implemented is to initialize a new ResourceRequest in ResourceRequestInfo
when a request has been successfully sent to the RM.
> The patch includes a test in which scenario one is tested.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message