hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Xianyin Xin (JIRA)" <j...@apache.org>
Subject [jira] [Resolved] (YARN-3639) It takes too long time for RM to recover all apps if the original active RM and NN go down at the same time.
Date Tue, 27 Oct 2015 04:17:27 GMT

     [ https://issues.apache.org/jira/browse/YARN-3639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Xianyin Xin resolved YARN-3639.
-------------------------------
    Resolution: Fixed

This has been resolved by YARN-4041, so close it.

> It takes too long time for RM to recover all apps if the original active RM and NN go
down at the same time.
> ------------------------------------------------------------------------------------------------------------
>
>                 Key: YARN-3639
>                 URL: https://issues.apache.org/jira/browse/YARN-3639
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager
>            Reporter: Xianyin Xin
>         Attachments: YARN-3639-recovery_log_1_app.txt
>
>
> If the active RM and NN go down at the same time, the new RM will take long time to recover
all apps. After analysis, we found the root cause is renewing HDFS tokens in the recovering
process. The HDFS client created by the renewer would firstly try to connect to the original
NN, the result of which is time-out after 10~20s, and then the client tries to connect to
the new NN. The entire recovery cost 15*#apps seconds according our test.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message