hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jun Gong (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-3094) reset timer for liveness monitors after RM recovery
Date Sat, 24 Jan 2015 02:17:35 GMT

    [ https://issues.apache.org/jira/browse/YARN-3094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14290377#comment-14290377
] 

Jun Gong commented on YARN-3094:
--------------------------------

[~rohithsharma] Thanks for your review. I will add a test case if needed.

{quote}
How many RUNNING applications are running in cluster?
{quote}
Just several hundreds apps running. The reason for slow recovery might be because a lot of
exceptions when storing RMApps' data using RMApplicationHistoryWriter. We will make further
investigation.

{quote}
What is the AM liveliness timeout configured in cluster?
{quote}
3 mins. Then we could find it earlier if AM crashes.

> reset timer for liveness monitors after RM recovery
> ---------------------------------------------------
>
>                 Key: YARN-3094
>                 URL: https://issues.apache.org/jira/browse/YARN-3094
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager
>    Affects Versions: 2.6.0
>            Reporter: Jun Gong
>            Assignee: Jun Gong
>         Attachments: YARN-3094.patch
>
>
> When RM restarts, it will recover RMAppAttempts and registry them to AMLivenessMonitor
if they are not in final state. AM will time out in RM if the recover process takes long time
due to some reasons(e.g. too many apps). 
> In our system, we found the recover process took about 3 mins, and all AM time out.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message