hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Arun Suresh (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (YARN-4876) [Phase 1] Decoupled Init / Destroy of Containers from Start / Stop
Date Tue, 14 Jun 2016 20:56:30 GMT

    [ https://issues.apache.org/jira/browse/YARN-4876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15330496#comment-15330496
] 

Arun Suresh edited comment on YARN-4876 at 6/14/16 8:56 PM:
------------------------------------------------------------

Aggregating and posting some design points on the patch based on offline discussions with
[~marco.rabozzi] :

h4. ContainerImpl state machine
In the current patch, containers that are initialized using the new initializeContainers APIs
keep waiting for startContainers requests within the LOCALIZED state after resource localization.
When the START_CONTAINER event is generated upon request from the application master, the
container transits to a new LAUNCHING state waiting for a CONTAINER_LAUNCHED event (this is
fired asynchronously by ContainerLaunch when the container process is being started). Upon
receiving the CONTAINER_LAUNCHED event, the container state is updated to RUNNING. For containers
that do not allow multi-start (i.e. those that are initialized and started using the standard
startContainers API), the START_CONTAINER event is automatically sent after localization.

The role of the new “LAUNCHING” state is to make a clear distinction between the following
two situations:
# The container has been localized and is waiting for a start request (LOCALIZED state)
# The container has received a start request and it is being started (LAUNCHING state)
In this fashion, we can allow a start (or a restart) of an idle container only if the container
is in the LOCALIZED state and if it allows multi-start. 

>From a first analysis, it seems that the new LAUNCHING state and the already present RELAUNCHING
state could by merged into a single LAUNCHING state to reduce the state machine complexity.

The destroyContainers API is equivalent to stopContainers if the specified containers do not
allow multi-start. On the other hand, in case of a container that allows multi-start, the
stopContainers API kills the container process and reverts the container state machine to
“LOCALIZED”. However, in order to properly catch the termination of a container process
for which a stop request has been issued, an additional “STOPPING” state has been inserted.
If the container is in RUNNING state and it allows multi-start, the application master can
issue a stopContainers request upon which the container state is updated to STOPPING and an
asynchronous request to kill the container process is sent. Within the stopping state, similarly
to the KILLING state, the container termination events (CONTAINER_EXITED_WITH_SUCCESS, CONTAINER_KILLED_ON_REQUEST,
CONTAINER_EXITED_WITH_FAILURE) are considered as a successful container stop, upon which the
container state reverts to LOCALIZED.

h4. Working directory cleanup
When a container is in the LOCALIZED state and multi-start is enabled, the application master
can issue the following 3 new types of requests:
# StartContainers (ContainerLaunchContext == NULL)
# InitializeContainers
# StartContainers (ContainerLaunchContext != NULL)

In case 1) the container is simply started using the ContainerLaunchContext issued in the
previous InitializeContainers request (the state machine transitions for this case are the
ones described in the previous section). Case 2) and 3) both perform reinitialization and
relocalization of container resources, the only difference between 2) and 3) is that in 3)
the container is also started after relocalization. Currently, when the container is reinitialized,
the container working directory is deleted to ensure a clean state for the subsequent container
starts. Actually, we could relax this behavior and allow the application master to specify
a deletion policy for container reinitialization. Depending on the requirements we might want
to address this aspect here or in a follow up JIRA.

h4. Log handling
Currently, there is no special handling of logs for a restarted container. The application
master can decide either to append the new logs to the old ones or overwrite the old logs.
This can be simply achieved by changing the launch command (e.g. in Linux use “>>”
to append and “>” to overwrite).

h4. Token expiration
Both the InitializeContainers and the StartContainers APIs require a container token to authorize
the request. For long running containers, the token might expire and the application master
won’t be able to request a restart or a reinitialization of a container. This limitation
currently holds also for the IncreaseContainerResource API. We might need to address container
token renewal in a separated JIRA.

h4. Recovery for container that allows multi-start
The current patch does not fully support recovery of containers that allows multi-start. Indeed,
after a restart of the NodeManager, if the container is not running, the NodeManager cannot
distinguish between a stopped container waiting for start or a container that completed its
execution successfully. Additional information in the state store might be needed to handle
this case.

h4. Auxiliary Service Data
In the current YARN implementation, a CONTAINER_INIT and a APPLICATION_INIT events are sent
to the auxiliary services every time a new container is initialized. With the new initializeContainers
API, it is possible to reinitialized a container multiple times even without actually starting
it. The actual implementation of the patch sends a CONTAINER_INIT and an APPLICATION_INIT
event for every reinitialization of a container (potentially sending new data to the auxiliary
services). We should verify weather this behavior is correct or needs to be modified.

h4. Container failures handling
In the current patch implementation, if a container fails during a reinitialization, the container
is destroyed. On the other hand, if the container fails within the STOPPING state, this is
considered as a successful stop. Should we allow the application master to specify a policy
for failures behaviors for stopping and reinitializing?

h4. 'Container Destroy' monitor
The proposed patch allows the application master to specify a destroyDelay after which an
idle container is destroyed automatically if not started within a given timeout. The destroy
logic is still not implemented in the current patch. We might need to implement a “destroy
containers monitor” service to check for container to destroy after a configurable time
interval. 

h4. Uploaded resource
During container relocalization, do we need specific logic for resources that are uploaded
to the shared cache? Currently, before localizing the new resources, the old container local
resources are released. Do we have to clean also the resourcesUploadPolicies map of ContainerImpl
during relocalization?



was (Author: asuresh):
Aggregating and posting some design points on the patch based on offline discussions with
[~marco.rabozzi] :

h4. ContainerImpl state machine
In the current patch, containers that are initialized using the new initializeContainers APIs
keep waiting for startContainers requests within the LOCALIZED state after resource localization.
When the START_CONTAINER event is generated upon request from the application master, the
container transits to a new LAUNCHING state waiting for a CONTAINER_LAUNCHED event (this is
fired asynchronously by ContainerLaunch when the container process is being started). Upon
receiving the CONTAINER_LAUNCHED event, the container state is updated to RUNNING. For containers
that do not allow multi-start (i.e. those that are initialized and started using the standard
startContainers API), the START_CONTAINER event is automatically sent after localization.

The role of the new “LAUNCHING” state is to make a clear distinction between the following
two situations:
# The container has been localized and is waiting for a start request (LOCALIZED state)
# The container has received a start request and it is being started (LAUNCHING state)
In this fashion, we can allow a start (or a restart) of an idle container only if the container
is in the LOCALIZED state and if it allows multi-start. 

>From a first analysis, it seems that the new LAUNCHING state and the already present RELAUNCHING
state could by merged into a single LAUNCHING state to reduce the state machine complexity.

The destroyContainers API is equivalent to stopContainers if the specified containers do not
allow multi-start. On the other hand, in case of a container that allows multi-start, the
stopContainers API kills the container process and reverts the container state machine to
“LOCALIZED”. However, in order to properly catch the termination of a container process
for which a stop request has been issued, an additional “STOPPING” state has been inserted.
If the container is in RUNNING state and it allows multi-start, the application master can
issue a stopContainers request upon which the container state is updated to STOPPING and an
asynchronous request to kill the container process is sent. Within the stopping state, similarly
to the KILLING state, the container termination events (CONTAINER_EXITED_WITH_SUCCESS, CONTAINER_KILLED_ON_REQUEST,
CONTAINER_EXITED_WITH_FAILURE) are considered as a successful container stop, upon which the
container state reverts to LOCALIZED.

h4. Working directory cleanup
When a container is in the LOCALIZED state and multi-start is enabled, the application master
can issue the following 3 new types of requests:
# StartContainers (ContainerLaunchContext == NULL)
# InitializeContainers
# StartContainers (ContainerLaunchContext != NULL)

In case 1) the container is simply started using the ContainerLaunchContext issued in the
previous InitializeContainers request (the state machine transitions for this case are the
ones described in the previous section). Case 2) and 3) both perform reinitialization and
relocalization of container resources, the only difference between 2) and 3) is that in 3)
the container is also started after relocalization. Currently, when the container is reinitialized,
the container working directory is deleted to ensure a clean state for the subsequent container
starts. Actually, we could relax this behavior and allow the application master to specify
a deletion policy for container reinitialization. Depending on the requirements we might want
to address this aspect here or in a follow up JIRA.

h4. Log handling
Currently, there is no special handling of logs for a restarted container. The application
master can decide either to append the new logs to the old ones or overwrite the old logs.
This can be simply achieved by changing the launch command (e.g. in Linux use “>>”
to append and “>” to overwrite).

h4. Token expiration
Both the InitializeContainers and the StartContainers APIs require a container token to authorize
the request. For long running containers, the token might expire and the application master
won’t be able to request a restart or a reinitialization of a container. This limitation
currently holds also for the IncreaseContainerResource API. We might need to address container
token renewal in a separated JIRA.

h4. Recovery for container that allows multi-start
The current patch does not fully support recovery of containers that allows multi-start. Indeed,
after a restart of the NodeManager, if the container is not running, the NodeManager cannot
distinguish between a stopped container waiting for start or a container that completed its
execution successfully. Additional information in the state store might be needed to handle
this case.

h4. Auxiliary Service Data
In the current YARN implementation, a CONTAINER_INIT and a APPLICATION_INIT events are sent
to the auxiliary services every time a new container is initialized. With the new initializeContainers
API, it is possible to reinitialized a container multiple times even without actually starting
it. The actual implementation of the patch sends a CONTAINER_INIT and an APPLICATION_INIT
event for every reinitialization of a container (potentially sending new data to the auxiliary
services). We should verify weather this behavior is correct or needs to be modified.

h4. Container failures handling
In the current patch implementation, if a container fails during a reinitialization, the container
is destroyed. On the other hand, if the container fails within the STOPPING state, this is
considered as a successful stop. Should we allow the application master to specify a policy
for failures behaviors for stopping and reinitializing?

h4. Destroy container monitor
The proposed patch allows the application master to specify a destroyDelay after which an
idle container is destroyed automatically if not started within a given timeout. The destroy
logic is still not implemented in the current patch. We might need to implement a “destroy
containers monitor” service to check for container to destroy after a configurable time
interval. 

h4. Uploaded resource
During container relocalization, do we need specific logic for resources that are uploaded
to the shared cache? Currently, before localizing the new resources, the old container local
resources are released. Do we have to clean also the resourcesUploadPolicies map of ContainerImpl
during relocalization?


> [Phase 1] Decoupled Init / Destroy of Containers from Start / Stop
> ------------------------------------------------------------------
>
>                 Key: YARN-4876
>                 URL: https://issues.apache.org/jira/browse/YARN-4876
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>            Reporter: Arun Suresh
>            Assignee: Marco Rabozzi
>         Attachments: YARN-4876-design-doc.pdf, YARN-4876.002.patch, YARN-4876.01.patch
>
>
> Introduce *initialize* and *destroy* container API into the *ContainerManagementProtocol*
and decouple the actual start of a container from the initialization. This will allow AMs
to re-start a container without having to lose the allocation.
> Additionally, if the localization of the container is associated to the initialize (and
the cleanup with the destroy), This can also be used by applications to upgrade a Container
by *re-initializing* with a new *ContainerLaunchContext*



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org


Mime
View raw message