hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Wangda Tan (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (YARN-7790) Improve Capacity Scheduler Async Scheduling to better handle node failures
Date Wed, 24 Jan 2018 02:19:00 GMT

     [ https://issues.apache.org/jira/browse/YARN-7790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Wangda Tan updated YARN-7790:
-----------------------------
    Description: 
This is not a new issue but async scheduling makes it worse:

In sync scheduling, if an AM container allocated to a node, it assumes node just heartbeat
to RM, and AM launcher will connect NM to launch the container. Even though it is possible
that NM crashes after the heartbeat, which causes AM hangs for a while. But it is related
rare.

In async scheduling world, multiple AM containers can be placed on a problematic NM, which
could cause application hangs easily. Discussed with [~sunilg] and [~jianhe] , we need one
fix:

When async scheduling enabled:
 - Skip node which missed X node heartbeat.

And in addition, it's better to reduce wait time by setting following configs to earlier fail
a container being launched at a NM with connectivity issue.
{code:java}
RetryPolicy retryPolicy =
    createRetryPolicy(conf,
      YarnConfiguration.CLIENT_NM_CONNECT_MAX_WAIT_MS,
      YarnConfiguration.DEFAULT_CLIENT_NM_CONNECT_MAX_WAIT_MS,
      YarnConfiguration.CLIENT_NM_CONNECT_RETRY_INTERVAL_MS,
      YarnConfiguration.DEFAULT_CLIENT_NM_CONNECT_RETRY_INTERVAL_MS);
{code}

  was:
This is not a new issue but async scheduling makes it worse:

In sync scheduling, if an AM container allocated to a node, it assumes node just heartbeat
to RM, and AM launcher will connect NM to launch the container. Even though it is possible
that NM crashes after the heartbeat, which causes AM hangs for a while. But it is related
rare.

In async scheduling world, multiple AM containers can be placed on a problematic NM, which
could cause application hangs easily. Discussed with [~sunilg] and [~jianhe] , we need one
fix:

When async scheduling enabled:
- Skip node which missed X node heartbeat.

And in addition, it's better to reduce wait time by setting following configs to earlier fail
a container being launched on a NM with connection issue.
{code:java}
RetryPolicy retryPolicy =
    createRetryPolicy(conf,
      YarnConfiguration.CLIENT_NM_CONNECT_MAX_WAIT_MS,
      YarnConfiguration.DEFAULT_CLIENT_NM_CONNECT_MAX_WAIT_MS,
      YarnConfiguration.CLIENT_NM_CONNECT_RETRY_INTERVAL_MS,
      YarnConfiguration.DEFAULT_CLIENT_NM_CONNECT_RETRY_INTERVAL_MS);
{code}


> Improve Capacity Scheduler Async Scheduling to better handle node failures
> --------------------------------------------------------------------------
>
>                 Key: YARN-7790
>                 URL: https://issues.apache.org/jira/browse/YARN-7790
>             Project: Hadoop YARN
>          Issue Type: Bug
>            Reporter: Sumana Sathish
>            Assignee: Wangda Tan
>            Priority: Critical
>         Attachments: YARN-7790.001.patch
>
>
> This is not a new issue but async scheduling makes it worse:
> In sync scheduling, if an AM container allocated to a node, it assumes node just heartbeat
to RM, and AM launcher will connect NM to launch the container. Even though it is possible
that NM crashes after the heartbeat, which causes AM hangs for a while. But it is related
rare.
> In async scheduling world, multiple AM containers can be placed on a problematic NM,
which could cause application hangs easily. Discussed with [~sunilg] and [~jianhe] , we need
one fix:
> When async scheduling enabled:
>  - Skip node which missed X node heartbeat.
> And in addition, it's better to reduce wait time by setting following configs to earlier
fail a container being launched at a NM with connectivity issue.
> {code:java}
> RetryPolicy retryPolicy =
>     createRetryPolicy(conf,
>       YarnConfiguration.CLIENT_NM_CONNECT_MAX_WAIT_MS,
>       YarnConfiguration.DEFAULT_CLIENT_NM_CONNECT_MAX_WAIT_MS,
>       YarnConfiguration.CLIENT_NM_CONNECT_RETRY_INTERVAL_MS,
>       YarnConfiguration.DEFAULT_CLIENT_NM_CONNECT_RETRY_INTERVAL_MS);
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org


Mime
View raw message