ambari-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Andrew Onischuk" <aonis...@hortonworks.com>
Subject Re: Review Request 34859: Ambari agents stop heartbeating after days of uptime
Date Sun, 31 May 2015 16:55:11 GMT

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/34859/#review85909
-----------------------------------------------------------

Ship it!


Ship It!

- Andrew Onischuk


On May 31, 2015, 4:47 p.m., Robert Levas wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/34859/
> -----------------------------------------------------------
> 
> (Updated May 31, 2015, 4:47 p.m.)
> 
> 
> Review request for Ambari, Andrew Onischuk, Emil Anca, and Jonathan Hurley.
> 
> 
> Bugs: AMBARI-11570
>     https://issues.apache.org/jira/browse/AMBARI-11570
> 
> 
> Repository: ambari
> 
> 
> Description
> -------
> 
> If a cluster has been up for several days, Ambari complains that one or more of the agents
have stopped heartbeating. This has been observed on Kerberized clusters, but may also occur
on non-Kerberized clusters (not tested).
> 
> Looking at the ambari-agent log, it appear there may be an _open file_ issue
> 
> */var/log/ambari-agent/ambari-agent.log*
> ```
> INFO 2015-05-28 11:43:13,547 Controller.py:244 - Heartbeat response received (id = 5382)
> INFO 2015-05-28 11:43:13,548 ActionQueue.py:99 - Adding STATUS_COMMAND for service ZOOKEEPER
of cluster BUG36704 to the queue.
> INFO 2015-05-28 11:43:13,548 ActionQueue.py:99 - Adding STATUS_COMMAND for service AMBARI_METRICS
of cluster BUG36704 to the queue.
> INFO 2015-05-28 11:43:13,548 ActionQueue.py:99 - Adding STATUS_COMMAND for service HDFS
of cluster BUG36704 to the queue.
> INFO 2015-05-28 11:43:13,548 ActionQueue.py:99 - Adding STATUS_COMMAND for service YARN
of cluster BUG36704 to the queue.
> INFO 2015-05-28 11:43:13,548 ActionQueue.py:99 - Adding STATUS_COMMAND for service HDFS
of cluster BUG36704 to the queue.
> INFO 2015-05-28 11:43:13,548 ActionQueue.py:99 - Adding STATUS_COMMAND for service MAPREDUCE2
of cluster BUG36704 to the queue.
> INFO 2015-05-28 11:43:13,548 ActionQueue.py:99 - Adding STATUS_COMMAND for service YARN
of cluster BUG36704 to the queue.
> INFO 2015-05-28 11:43:13,548 ActionQueue.py:99 - Adding STATUS_COMMAND for service TEZ
of cluster BUG36704 to the queue.
> INFO 2015-05-28 11:43:13,548 ActionQueue.py:99 - Adding STATUS_COMMAND for service HIVE
of cluster BUG36704 to the queue.
> INFO 2015-05-28 11:43:13,548 ActionQueue.py:99 - Adding STATUS_COMMAND for service HIVE
of cluster BUG36704 to the queue.
> INFO 2015-05-28 11:43:13,549 ActionQueue.py:99 - Adding STATUS_COMMAND for service PIG
of cluster BUG36704 to the queue.
> INFO 2015-05-28 11:43:13,549 ActionQueue.py:99 - Adding STATUS_COMMAND for service ZOOKEEPER
of cluster BUG36704 to the queue.
> INFO 2015-05-28 11:43:13,549 ActionQueue.py:99 - Adding STATUS_COMMAND for service AMBARI_METRICS
of cluster BUG36704 to the queue.
> INFO 2015-05-28 11:43:13,549 ActionQueue.py:99 - Adding STATUS_COMMAND for service KERBEROS
of cluster BUG36704 to the queue.
> INFO 2015-05-28 11:43:23,549 Heartbeat.py:78 - Building Heartbeat: {responseId = 5382,
timestamp = 1432813403549, commandsInProgress = False, componentsMapped = True}
> ERROR 2015-05-28 11:43:23,553 Controller.py:330 - Connection to levas-36704-1.c.pramod-thangali.internal
was lost (details=[Errno 24] Too many open files: '/sys/kernel/mm/redhat_transparent_hugepage/enabled')
> INFO 2015-05-28 11:43:34,555 NetUtil.py:59 - Connecting to https://levas-36704-1.c.pramod-thangali.internal:8440/connection_info
> INFO 2015-05-28 11:43:34,627 security.py:93 - SSL Connect being called.. connecting to
the server
> INFO 2015-05-28 11:43:34,696 security.py:55 - SSL connection established. Two-way SSL
authentication is turned off on the server.
> INFO 2015-05-28 11:43:34,897 Controller.py:244 - Heartbeat response received (id = 5382)
> ERROR 2015-05-28 11:43:34,897 Controller.py:262 - Error in responseId sequence - restarting
> WARNING 2015-05-28 11:43:42,860 base_alert.py:140 - [Alert][yarn_nodemanager_health]
Unable to execute alert. [Errno 24] Too many open files
> WARNING 2015-05-28 11:43:42,873 base_alert.py:140 - [Alert][ams_metrics_monitor_process]
Unable to execute alert. [Errno 24] Too many open files
> WARNING 2015-05-28 11:44:00,537 base_alert.py:140 - [Alert][ambari_agent_disk_usage]
Unable to execute alert. [Errno 24] Too many open files
> WARNING 2015-05-28 11:44:42,860 base_alert.py:140 - [Alert][yarn_nodemanager_health]
Unable to execute alert. [Errno 24] Too many open files
> WARNING 2015-05-28 11:44:42,880 base_alert.py:140 - [Alert][ams_metrics_monitor_process]
Unable to execute alert. [Errno 24] Too many open files
> WARNING 2015-05-28 11:44:43,002 base_alert.py:140 - [Alert][datanode_storage] Unable
to execute alert. [Errno 24] Too many open files
> WARNING 2015-05-28 11:45:00,549 base_alert.py:140 - [Alert][ambari_agent_disk_usage]
Unable to execute alert. [Errno 24] Too many open files
> WARNING 2015-05-28 11:45:42,860 base_alert.py:140 - [Alert][yarn_nodemanager_health]
Unable to execute alert. [Errno 24] Too many open files
> WARNING 2015-05-28 11:45:42,873 base_alert.py:140 - [Alert][ams_metrics_monitor_process]
Unable to execute alert. [Errno 24] Too many open files
> WARNING 2015-05-28 11:46:00,537 base_alert.py:140 - [Alert][ambari_agent_disk_usage]
Unable to execute alert. [Errno 24] Too many open files
> WARNING 2015-05-28 11:46:42,861 base_alert.py:140 - [Alert][yarn_nodemanager_health]
Unable to execute alert. [Errno 24] Too many open files
> WARNING 2015-05-28 11:46:42,863 base_alert.py:140 - [Alert][ams_metrics_collector_hbase_master_cpu]
Unable to execute alert. [Alert][ams_metrics_collector_hbase_master_cpu] Unable to get json
from jmx response!
> WARNING 2015-05-28 11:46:42,892 base_alert.py:140 - [Alert][ams_metrics_monitor_process]
Unable to execute alert. [Errno 24] Too many open files
> WARNING 2015-05-28 11:46:42,899 base_alert.py:140 - [Alert][datanode_storage] Unable
to execute alert. [Errno 24] Too many open files
> WARNING 2015-05-28 11:47:00,539 base_alert.py:140 - [Alert][ambari_agent_disk_usage]
Unable to execute alert. [Errno 24] Too many open files
> WARNING 2015-05-28 11:47:42,861 base_alert.py:140 - [Alert][yarn_nodemanager_health]
Unable to execute alert. [Errno 24] Too many open files
> WARNING 2015-05-28 11:47:42,873 base_alert.py:140 - [Alert][ams_metrics_monitor_process]
Unable to execute alert. [Errno 24] Too many open files
> WARNING 2015-05-28 11:48:00,541 base_alert.py:140 - [Alert][ambari_agent_disk_usage]
Unable to execute alert. [Errno 24] Too many open files
> ...
> ```
> 
> Restarting the ambari-agent reconnects with the server and the cluster becomes happy.
> 
> 
> Diffs
> -----
> 
>   ambari-agent/src/test/python/resource_management/TestSecurityCommons.py ead0351 
>   ambari-common/src/main/python/resource_management/libraries/functions/security_commons.py
688eba7 
> 
> Diff: https://reviews.apache.org/r/34859/diff/
> 
> 
> Testing
> -------
> 
> Manually tested and viewed `lsof` output to make sure previosuly offending open files
were no longer left open.
> 
> 
> Thanks,
> 
> Robert Levas
> 
>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message