hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Xianyin Xin (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-3639) It takes too long time for RM to recover all apps if the original active RM and namenode is deployed on the same node.
Date Thu, 14 May 2015 01:36:00 GMT

    [ https://issues.apache.org/jira/browse/YARN-3639?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14543053#comment-14543053
] 

Xianyin Xin commented on YARN-3639:
-----------------------------------

Maybe we can fix this problem in the following way: the latter apps should learn the lesson
given by the former apps, i.e., if one app find the original NN could not connect and then
it connect to the new NN successfully, the latter apps should be aware of this to avoid repeating
the failure. The token renewer creates a HDFS client when it tries to renew a HDFS token for
an app, maybe the following apps could reuse the client?

> It takes too long time for RM to recover all apps if the original active RM and namenode
is deployed on the same node.
> ----------------------------------------------------------------------------------------------------------------------
>
>                 Key: YARN-3639
>                 URL: https://issues.apache.org/jira/browse/YARN-3639
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager
>            Reporter: Xianyin Xin
>         Attachments: YARN-3639-recovery_log_1_app.txt
>
>
> If the node on which the active RM runs dies and if the active namenode is running on
the same node, the new RM will take long time to recover all apps. After analysis, we found
the root cause is renewing HDFS tokens in the recovering process. The HDFS client created
by the renewer would firstly try to connect to the original namenode, the result of which
is time-out after 10~20s, and then the client tries to connect to the new namenode. The entire
recovery cost 15*#apps seconds according our test.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message