ambari-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hudson (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (AMBARI-12522) Provide traceback patch to debug hanging agents
Date Fri, 24 Jul 2015 18:18:06 GMT

    [ https://issues.apache.org/jira/browse/AMBARI-12522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14640862#comment-14640862
] 

Hudson commented on AMBARI-12522:
---------------------------------

FAILURE: Integrated in Ambari-trunk-Commit #3167 (See [https://builds.apache.org/job/Ambari-trunk-Commit/3167/])
AMBARI-12522. Provide traceback patch to debug hanging agents (dlysnichenko) (dlysnichenko:
http://git-wip-us.apache.org/repos/asf?p=ambari.git&a=commit&h=7c12637cbb1aaec53a600c420b8d731044bc6ba5)
* ambari-agent/src/test/python/ambari_agent/TestMain.py
* ambari-agent/src/main/python/ambari_agent/HeartbeatHandlers.py
* ambari-agent/src/main/python/ambari_agent/main.py


> Provide traceback patch to debug hanging agents
> -----------------------------------------------
>
>                 Key: AMBARI-12522
>                 URL: https://issues.apache.org/jira/browse/AMBARI-12522
>             Project: Ambari
>          Issue Type: Bug
>          Components: ambari-server
>            Reporter: Dmitry Lysnichenko
>            Assignee: Dmitry Lysnichenko
>             Fix For: 2.2.0
>
>         Attachments: AMBARI-12522.patch
>
>
> there has been a few reports (on trunk and using local VMS) that at times the agent become
super busy and not process any commands. Only way out is agent restart. 
> Some time ago we had signal handler that would dump traceback to log and open debugger,
or something like that. But it looks like to be removed already. We decided to reimplement
this signal handler
> Patch tries to load and register traceback handler if it is available, and skips if not.
Also it fixes binding signal handlers twice during agent start.
> To install faulthandler under Centos 6 (*faulthandler is not included to default distribution
of Python 2.x*), we have to perform:
> {code}
> yum install python-devel gcc -y
> # install setup tools
> curl https://bitbucket.org/pypa/setuptools/raw/bootstrap/ez_setup.py | python -
> # install pip
> curl https://raw.github.com/pypa/pip/master/contrib/get-pip.py | python -
> easy_install faulthandler
> {code}
> If faulthandler module is available, agent posts {{Registered faulthandler}} to agent
out file.
> After that, we start agent and can dump tracebacks for all running threads like that:
> {code}
> # kill -USR1 `cat /var/run/ambari-agent/ambari-agent.pid`
> # cat /var/log/ambari-agent/ambari-agent.out
> Registered faulthandler
> Thread 0x00007feccffff700 (most recent call first):
>   File "/usr/lib64/python2.6/socket.py", line 197 in accept
>   File "/usr/lib/python2.6/site-packages/ambari_agent/PingPortListener.py", line 67 in
run
>   File "/usr/lib64/python2.6/threading.py", line 532 in __bootstrap_inner
>   File "/usr/lib64/python2.6/threading.py", line 504 in __bootstrap
> Thread 0x00007fecd4a89700 (most recent call first):
>   File "/usr/lib/python2.6/site-packages/ambari_agent/DataCleaner.py", line 123 in run
>   File "/usr/lib64/python2.6/threading.py", line 532 in __bootstrap_inner
>   File "/usr/lib64/python2.6/threading.py", line 504 in __bootstrap
> Current thread 0x00007fecdfe8c700 (most recent call first):
>   File "/usr/lib64/python2.6/threading.py", line 258 in wait
>   File "/usr/lib64/python2.6/threading.py", line 395 in wait
>   File "/usr/lib/python2.6/site-packages/ambari_agent/HeartbeatHandlers.py", line 122
in wait
>   File "/usr/lib/python2.6/site-packages/ambari_agent/NetUtil.py", line 108 in try_to_connect
>   File "/usr/lib/python2.6/site-packages/ambari_agent/main.py", line 291 in main
>   File "/usr/lib/python2.6/site-packages/ambari_agent/main.py", line 306 in <module>
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message