hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Junping Du (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-4725) [Umbrella] Auto-­restart of containers
Date Tue, 08 Mar 2016 11:46:40 GMT

    [ https://issues.apache.org/jira/browse/YARN-4725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15184821#comment-15184821

Junping Du commented on YARN-4725:

For the same requirement, can we implement this in another way - adding stickiness to first
attempt NM for service container in RM scheduling, rather than NM launch? Also, in another
opinion, service container could need higher bar for quality of NM that may need to blacklist
NMs that cause previous failure in follow-up running.

> [Umbrella] Auto-­restart of containers
> --------------------------------------
>                 Key: YARN-4725
>                 URL: https://issues.apache.org/jira/browse/YARN-4725
>             Project: Hadoop YARN
>          Issue Type: Bug
>            Reporter: Vinod Kumar Vavilapalli
> See overview doc at YARN-4692, copying the sub-section to track all related efforts.
> Today, when a container (process­-tree) dies, NodeManager assumes that the container’s
allocation is also expired, and reports accordingly to the ResourceManager which then releases
the allocation. For service containers, this is undesirable in many cases. Long running containers
may exit for various reasons, crash and need to restart but forcing them to go through the
complete scheduling cycle, resource localization etc is both unnecessary and expensive. (​Task)
​For services it will be good to have NodeManagers automatically restart containers. This
looks a lot like inittab / daemon­tools at the system level.
> We will need to enable app­-specific policies (very similar to the handling of AM restarts
at YARN level) for restarting containers automatically but limit such restarts if a container
dies too often in a short interval of time.
> YARN-3998 is an existing ticket that looks at some if not all of this functionality.

This message was sent by Atlassian JIRA

View raw message