ambari-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Di Li (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (AMBARI-18929) Yarn service check fails when either resource manager is down in HA enabled cluster
Date Wed, 23 Nov 2016 13:41:58 GMT

    [ https://issues.apache.org/jira/browse/AMBARI-18929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15690127#comment-15690127
] 

Di Li commented on AMBARI-18929:
--------------------------------

[~cheersyang] Principal-wise the two service checks seem to fail for the same reason, they
both fail prematurely when the logic hits a dead node.

Implementation-wise they are different.  Yarn has the following logic ( also with a very short
timeout in my opinion). As you can see, it assumes both RM are online, it should also handle
curl exit code for better error handling.

for rm_webapp_address in params.rm_webapp_addresses_list:
      info_app_url = params.scheme + "://" + rm_webapp_address + "/ws/v1/cluster/apps/" +
application_name

      get_app_info_cmd = "curl --negotiate -u : -ks --location-trusted --connect-timeout "
+ CURL_CONNECTION_TIMEOUT + " " + info_app_url

      return_code, stdout, _ = get_user_call_output(get_app_info_cmd,
                                            user=params.smokeuser,
                                            path='/usr/sbin:/sbin:/usr/local/bin:/bin:/usr/bin',
                                            )

      # Handle HDP<2.2.8.1 where RM doesn't do automatic redirection from standby to active
      if stdout.startswith("This is standby RM. Redirecting to the current active RM:"):
        Logger.info(format("Skipped checking of {rm_webapp_address} since returned '{stdout}'"))
        continue

For HDFS, it's a two-path approach, I haven't run it but I suspect it'd be the second part
that fails on checkWebUI.py logic? If so, it'd be the same suggestion, better error handling
to continue with the check until all hosts are pinged.

> Yarn service check fails when either resource manager is down in HA enabled cluster
> -----------------------------------------------------------------------------------
>
>                 Key: AMBARI-18929
>                 URL: https://issues.apache.org/jira/browse/AMBARI-18929
>             Project: Ambari
>          Issue Type: Bug
>          Components: ambari-server
>    Affects Versions: 2.4.0
>            Reporter: Weiwei Yang
>
> When HA is enabled, yarn service_check.py fails if one of RM is down, even the other
one is active. This gives user the wrong impression the yarn cluster is not healthy. Instead,
service check should pass, or at least pass with warning that lets user know there is one
RM down.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message