hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "MENG DING (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-1645) ContainerManager implementation to support container resizing
Date Mon, 13 Jul 2015 16:45:07 GMT

    [ https://issues.apache.org/jira/browse/YARN-1645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14624921#comment-14624921
] 

MENG DING commented on YARN-1645:
---------------------------------

Thanks for the review [~jianhe] !

bq. This check should not be needed, because AM should be able to resize an existing container
no matter RM restarted or not.

I have some concerns regarding this that I hope to get some clarifications. According to the
work-preserving RM restart documentation (http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/ResourceManagerRestart.html):

bq. RM recovers its runing state by taking advantage of the container statuses sent from all
NMs. NM will not kill the containers when it re-syncs with the restarted RM. It continues
managing the containers and send the container statuses across to RM when it re-registers.
RM reconstructs the container instances and the associated applications’ scheduling status
by absorbing these containers’ information

Consider this scenario:
* RM approves a container resource increase request and sends an increase token to AM. 
* Before AM actually increases the resource on NM, RM crashes and then restarts. Because of
the work preserving recovery, RM re-constructs the container resource based on the information
sent by NM, and it is still the old resource allocation for the container before the increase.
* Now AM does the increase action on NM. If NM doesn't reject this, it will start to enforce
the container with increased resource.  Now the views of resource allocation between RM and
NM are inconsistent.

Thoughts?

bq. A lot of code is duplicate between authorizeStartRequest and authorizeResourceIncreaseRequest
- could you refactor the code to share the same code ?
Will do

bq. Portion of the code belongs to YARN-1644 and the patch won't compile.
This is the same situations with YARN-1449. Everything is intertwined :-( May need to combine
everything into a big patch to submit for jenkins build.


> ContainerManager implementation to support container resizing
> -------------------------------------------------------------
>
>                 Key: YARN-1645
>                 URL: https://issues.apache.org/jira/browse/YARN-1645
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: nodemanager
>            Reporter: Wangda Tan
>            Assignee: MENG DING
>         Attachments: YARN-1645.1.patch, YARN-1645.2.patch, yarn-1645.1.patch
>
>
> Implementation of ContainerManager for container resize, including:
> 1) ContainerManager resize logic 
> 2) Relevant test cases



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message