ambari-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Weiwei Yang (JIRA)" <>
Subject [jira] [Updated] (AMBARI-18929) Yarn service check fails when either resource manager is down in HA enabled cluster
Date Fri, 25 Nov 2016 04:03:59 GMT


Weiwei Yang updated AMBARI-18929:
    Attachment: AMBARI-18929_trunk.patch

Hi [~Tim Thorpe], [~dili]

Attached a patch to fix this. With this patch, yarn service check first queries rest api {{http://<rm_host>:<port>/ws/v1/cluster/info}}
to figure out the active rm address (this api is available since hadoop 2.3 the very first
version to support HA), and this api is provided by both active and standby RMs as well as
the non-HA env single RM, no redirection. Once active RM figured, the rest of logic remains
same. Otherwise the service check will fail either because http service can not be accessed
on both RMs, or both RMs are in standby state.

I tested this patch on following scenarios

HA environment
# Both active & standby RMs are up : SUCCESS
# Shutdown standby RM, active remains up : SUCCESS
# Shutdown active RM, active transited to the other RM : SUCCESS
# Shutdown zookeeper, both RMs are standby : FAIL
# Both RMs are down : FAIL

Non-HA environment
# RM is up : SUCCESS
# RM is down : FAIL

Please help to review the patch.

> Yarn service check fails when either resource manager is down in HA enabled cluster
> -----------------------------------------------------------------------------------
>                 Key: AMBARI-18929
>                 URL:
>             Project: Ambari
>          Issue Type: Bug
>          Components: ambari-server
>    Affects Versions: 2.4.0
>            Reporter: Weiwei Yang
>         Attachments: AMBARI-18929_trunk.patch
> When HA is enabled, yarn fails if one of RM is down, even the other
one is active. This gives user the wrong impression the yarn cluster is not healthy. Instead,
service check should pass, or at least pass with warning that lets user know there is one
RM down.

This message was sent by Atlassian JIRA

View raw message