hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Arun Suresh (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (YARN-7275) NM Statestore cleanup for Container updates
Date Thu, 12 Oct 2017 14:52:00 GMT

    [ https://issues.apache.org/jira/browse/YARN-7275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16202047#comment-16202047
] 

Arun Suresh edited comment on YARN-7275 at 10/12/17 2:51 PM:
-------------------------------------------------------------

Thanks for the updated patch [~kartheek]

Couple of comments:
* In the new {{ContainerScheduler::recoverActiveContainer}} method, if the container is running,
you need to update the utilizationTracker {{this.utilizationTracker.addContainerResources(..)}}
* After the recovery process is complete on the NM, we need to consider the following:
** It is possible that just before the NM went down, some of the queued containers might have
been in the process of being started or resumed (the container would be in RESUMING / SCHEDULED
and the recovered container state would be QUEUED) - The LAUNCH event was sent but did not
reach the 'ContainerLaunch' in which case - The {{ContainerScheduler}} would need resend those
events.
** It is also possible that just before the NM went down, some running containers we in the
process of being PAUSED (the container would be in the PAUSING state and the rcs would be
RUNNING) - The kill/pause event was sent but again did not reach the executor
* Both the above scenarios should be covered by calling the {{ContainerScheduler::startPendingContainers(..)}}
method on the ContainerScheduler. It will check if there are queued opportunisitc containers
and start/resume them. I propose we create another {{ContainerSchedulerEventType}} - just
call it RECOVERY_COMPLETED and dispatch this event to the containerScheduler at the end of
the {{ContainerManager::recover()}} method. In the ContainerScheduler, when we receive the
event, just call {{startPendingContainers(..)}}. Makes sense ?

Also with regard to my earlier comment:
bq.  in addition to storing the container update token, use the old resource update key and
store the changed resource also.
apologize, but I think we can revert it back to how you had it in your earlier patch - because
it looks this wont guarantee rollback will work - since the old version of the NM will still
see the new key and bomb anyway. So we will just have to document that somewhere that if a
running container is updated, roll back is not possible until container is completed.


was (Author: asuresh):
Thanks for the updated patch [~kartheek]

Couple of comments:
* In the new {{ContainerScheduler::recoverActiveContainer}} method, if the container is running,
you need to update the utilizationTracker {{this.utilizationTracker.addContainerResources(..)}}
* After the recovery process is complete on the NM, we need to consider the following:
** It is possible that just before the NM went down, some of the queued containers might have
been in the process of being started or resumed (the container would be in RESUMING / SCHEDULED
and the recovered container state would be QUEUED) - The LAUNCH event was sent but did not
reach the 'ContainerLaunch' in which case - The {{ContainerScheduler}} would need resend those
events.
** It is also possible that just before the NM went down, some running containers we in the
process of being PAUSED (the container would be in the PAUSING state and the rcs would be
RUNNING) - The kill/pause event was sent but again did not reach the executor
* Both the above scenarios should be covered by calling the {{ContainerScheduler::startPendingContainers(..)}}
method on the ContainerScheduler. It will check if there are queued opportunisitc containers
and start/resume them. I propose we create another {{ContainerSchedulerEventType}} - just
call it RECOVERY_COMPLETED and dispatch this event to the containerScheduler at the end of
the {{ContainerManager::recover()}} method. In the ContainerScheduler, when we receive the
event, just call {{startPendingContainers(..)}}. Makes sense ?

Also with regard to my earlier commet:
bq.  in addition to storing the container update token, use the old resource update key and
store the changed resource also.
apologize, but I think we can revert it back to how you had it in your earlier patch - because
it looks this wont guarantee rollback will work - since the old version of the NM will still
see the new key and bomb anyway. So we will just have to document that somewhere that if a
running container is updated, roll back is not possible until container is completed.

> NM Statestore cleanup for Container updates
> -------------------------------------------
>
>                 Key: YARN-7275
>                 URL: https://issues.apache.org/jira/browse/YARN-7275
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>            Reporter: Arun Suresh
>            Assignee: kartheek muthyala
>            Priority: Blocker
>         Attachments: YARN-7275.001.patch, YARN-7275.002.patch, YARN-7275.003.patch, YARN-7275.004.patch
>
>
> Currently, only resource updates are recorded in the NM state store, we need to add ExecutionType
updates as well.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org


Mime
View raw message