hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Xianyin Xin (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-4414) Nodemanager connection errors are retried at multiple levels
Date Fri, 08 Jan 2016 07:02:39 GMT

    [ https://issues.apache.org/jira/browse/YARN-4414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15088804#comment-15088804

Xianyin Xin commented on YARN-4414:

Hi [~lichangleo], need we also revisit the two layer retries in {{RMProxy}}? IIUC, the proxy
layer will retry upto 15 min with a retry interval 30 sec, but at the background, the RM proxy
will calculate a max retry times by the two values. The time consuming of IPC layer retry
is more than 1 sec, and by default retry 10 times, the result of which is the actual total
wait time is 15 min + 15 / 0.5 * 10 * (more than 1 sec), which is much more than 15 min.

> Nodemanager connection errors are retried at multiple levels
> ------------------------------------------------------------
>                 Key: YARN-4414
>                 URL: https://issues.apache.org/jira/browse/YARN-4414
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: nodemanager
>    Affects Versions: 2.7.1, 2.6.2
>            Reporter: Jason Lowe
>            Assignee: Chang Li
>         Attachments: YARN-4414.1.2.patch, YARN-4414.1.2.patch, YARN-4414.1.3.patch, YARN-4414.1.patch,
> This is related to YARN-3238.  Ran into more scenarios where connection errors are being
retried at multiple levels, like NoRouteToHostException.  The fix for YARN-3238 was too specific,
and I think we need a more general solution to catch a wider array of connection errors that
can occur to avoid retrying them both at the RPC layer and at the NM proxy layer.

This message was sent by Atlassian JIRA

View raw message