ambari-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Alex Piggott (JIRA)" <>
Subject [jira] [Created] (AMBARI-12485) Ambari agent stopped reporting status until some file was deleted
Date Tue, 21 Jul 2015 21:11:05 GMT
Alex Piggott created AMBARI-12485:

             Summary: Ambari agent stopped reporting status until some file was deleted
                 Key: AMBARI-12485
             Project: Ambari
          Issue Type: Bug
          Components: ambari-agent
    Affects Versions: 2.0.0
         Environment: Centos6
            Reporter: Alex Piggott

1) I restarted YARN after making a config change, and observed that on one of the 4 nodes
of a cluster (call it db001) was not restarting any of them.

2) I restarted ambari-agent on db001 from the command line, at which point all services remained
shown as down (red)

3) Note that I _was_ then able to restart the YARN components on db001

4) I found the following error message being generated every minute:

[root@db001 ~]# more /var/lib/ambari-agent/data/status_command_stderr.txt
Traceback (most recent call last):
  File "/var/lib/ambari-agent/cache/common-services/ZOOKEEPER/",
line 67, in <module>
  File "/usr/lib/python2.6/site-packages/resource_management/libraries/script/",
line 181, in execute
  File "/usr/lib/python2.6/site-packages/resource_management/libraries/script/",
line 109, in load_structured_out
    Script.structuredOut = json.load(fp)
  File "/usr/lib64/python2.6/json/", line 267, in load
    parse_constant=parse_constant, **kw)
  File "/usr/l
ib64/python2.6/json/", line 307, in loads
    return _default_decoder.decode(s)
  File "/usr/lib64/python2.6/json/", line 319, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/usr/lib64/python2.6/json/", line 338, in raw_decode
    raise ValueError("No JSON object could be decoded")
ValueError: No JSON object could be decoded

-rw-r--r-- 1 root root     0 Jul 21 16:50 status_command_stdout.txt
-rw------- 1 root root 18310 Jul 21 16:50 status_command.json
-rw-r--r-- 1 root root  1008 Jul 21 16:50 status_command_stderr.txt

I stuck some print statements in the python (!!) and found out that the failing file was an
empty file not modified since Jul 19 (today==Jul 21):

[root@db001 data]# ls -l /var/lib/ambari-agent/data/structured-out-status.json
-rw-rw-rw- 1 root root 0 Jul 19 01:22 /var/lib/ambari-agent/data/structured-out-status.json

Upon deleting that, the error messages went away, and Ambari showed all components as green

Note that nobody had touched the cluster since July 14

Hope this report is of some use!

This message was sent by Atlassian JIRA

View raw message