ambari-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Andrew Onischuk" <aonis...@hortonworks.com>
Subject Re: Review Request 31946: Ambari-agent died under SLES (and could not even restart automatically)
Date Wed, 11 Mar 2015 18:00:05 GMT

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/31946/
-----------------------------------------------------------

(Updated March 11, 2015, 6 p.m.)


Review request for Ambari and Dmitro Lisnichenko.


Bugs: AMBARI-10031
    https://issues.apache.org/jira/browse/AMBARI-10031


Repository: ambari


Description (updated)
-------

I was performing RU on weekend and left cluster running to finalize it later.
So cluster was running unattended for 2 days, and ambari-agent died due to out
of memory. Agents on other nodes are running well.  
Node has 8gb of ram, does not look like memory exhausted (unless agent needs
more then 1100 mb of ram)

    
    
    
    dmitriusan-sles3-ru1-6:~ # free -m
                 total       used       free     shared    buffers     cached
    Mem:          7872       7077        795          0        134        222
    -/+ buffers/cache:       6720       1151
    Swap:            0          0          0
    

So I suspect memory leak (probably due to status checks/jobs). Log files
attached.

    
    
    
    WARNING 2015-03-10 06:10:30,692 scheduler.py:496 - Run time of job "c811d199-b07f-4eaf-995b-bf91e5ff848f
(trigger: interval[0:01:00], next run at: 2015-03-10
     06:11:27.480393)" was missed by 0:00:03.212293
    WARNING 2015-03-10 06:10:38,214 scheduler.py:496 - Run time of job "5c219f4e-62e1-482c-88fc-e11b40935541
(trigger: interval[0:01:00], next run at: 2015-03-10
     06:11:29.881993)" was missed by 0:00:08.332634
    INFO 2015-03-10 06:10:38,995 scheduler.py:527 - Job "13163515-f895-4342-b802-12ce39c65fb9
(trigger: interval[0:01:00], next run at: 2015-03-10 06:11:27.47368
    5)" executed successfully
    INFO 2015-03-10 06:10:39,088 scheduler.py:527 - Job "6186b998-9eb6-4f7b-af8b-96c27c0da962
(trigger: interval[0:01:00], next run at: 2015-03-10 06:11:27.47213
    9)" executed successfully
    INFO 2015-03-10 06:10:39,089 scheduler.py:527 - Job "1531e319-25e9-4909-b461-bec0ba59c1d9
(trigger: interval[0:01:00], next run at: 2015-03-10 06:11:27.47290
    7)" executed successfully
    INFO 2015-03-10 06:10:39,123 Controller.py:247 - Heartbeat response received (id = 21240)
    INFO 2015-03-10 06:10:39,408 Controller.py:291 - No commands sent from dmitriusan-sles3-ru1-5.cs1cloud.internal
    INFO 2015-03-10 06:10:42,672 scheduler.py:527 - Job "81137f2d-a1a8-433f-9446-4167a06b6fa3
(trigger: interval[0:01:00], next run at: 2015-03-10 06:11:27.47332
    0)" executed successfully
    WARNING 2015-03-10 06:10:43,575 scheduler.py:496 - Run time of job "84ac5821-646b-41c1-8ac7-a561cd75d3ef
(trigger: interval[0:01:00], next run at: 2015-03-10
     06:10:41.837046)" was missed by 0:00:01.737801
    ERROR 2015-03-10 06:10:45,043 CustomServiceOrchestrator.py:201 - Caught an exception while
executing custom service command: <type 'exceptions.OSError'>: [Er
    rno 12] Cannot allocate memory; [Errno 12] Cannot allocate memory
    Traceback (most recent call last):
      File "/usr/lib/python2.6/site-packages/ambari_agent/CustomServiceOrchestrator.py", line
176, in runCommand
        task_id, override_output_files, handle = handle)
      File "/usr/lib/python2.6/site-packages/ambari_agent/PythonExecutor.py", line 84, in
run_file
        process = self.launch_python_subprocess(pythonCommand, tmpout, tmperr)
      File "/usr/lib/python2.6/site-packages/ambari_agent/PythonExecutor.py", line 151, in
launch_python_subprocess
        stderr=tmperr, close_fds=close_fds, env=command_env)
      File "/usr/lib64/python2.6/subprocess.py", line 623, in __init__
        errread, errwrite)
      File "/usr/lib64/python2.6/subprocess.py", line 1051, in _execute_child
        self.pid = os.fork()
    OSError: [Errno 12] Cannot allocate memory
    

Also, agent could not restart automatically:

    
    
    
    INFO 2015-03-10 06:11:44,312 NetUtil.py:60 - Connecting to https://dmitriusan-sles3-ru1-5.cs1cloud.internal:8440/connection_info
    INFO 2015-03-10 06:11:44,639 security.py:93 - SSL Connect being called.. connecting to
the server
    INFO 2015-03-10 06:11:44,730 security.py:55 - SSL connection established. Two-way SSL
authentication is turned off on the server.
    INFO 2015-03-10 06:11:44,733 Controller.py:247 - Heartbeat response received (id = 21240)
    ERROR 2015-03-10 06:11:44,733 Controller.py:261 - Error in responseId sequence - restarting
    INFO 2015-03-10 06:11:46,986 main.py:68 - loglevel=logging.INFO
    INFO 2015-03-10 06:11:46,988 DataCleaner.py:36 - Data cleanup thread started
    INFO 2015-03-10 06:11:46,997 DataCleaner.py:117 - Data cleanup started
    INFO 2015-03-10 06:11:47,222 DataCleaner.py:119 - Data cleanup finished
    ERROR 2015-03-10 06:11:47,641 main.py:243 - Failed to start ping port listener of: Could
not open port 8670 because port already used by another process:
    UID        PID  PPID  C STIME TTY          TIME CMD
    root      1421     1  0 06:07 ?        00:00:00 /usr/bin/sudo su ambari-qa -l -s /bin/bash
-c export  PATH='/usr/sbin:/sbin:/usr/lib/ambari-server/*:/sbin:/u
    sr/sbin:/usr/local/sbin:/root/bin:/usr/local/bin:/usr/bin:/bin:/usr/X11R6/bin:/usr/games:/usr/lib/mit/bin:/usr/lib/mit/sbin:/var/lib/ambari-agent:/bin/:/usr/
    bin/:/usr/lib/hive/bin/:/usr/sbin/' ; ! beeline -u 'jdbc:hive2://dmitriusan-sles3-ru1-6.cs1cloud.internal:10000'
-e '' 2>&1| awk '{print}'|grep -i -e 'Connec
    tion refused' -e 'Invalid URL'
    
    INFO 2015-03-10 06:11:47,654 PingPortListener.py:62 - Ping port listener killed
    

Also, manual restart failed as well

    
    
    
    ERROR: ambari-agent start failed. For more details, see /var/log/ambari-agent/ambari-agent.out:
    ====================
    Failed to start ping port listener of: Could not open port 8670 because port already used
by another process:
    UID        PID  PPID  C STIME TTY          TIME CMD
    root     25597     1  0 05:59 ?        00:00:00 /usr/bin/sudo su ambari-qa -l -s /bin/bash
-c export  PATH='/usr/sbin:/sbin:/usr/lib/ambari-server/*:/sbin:/usr/sbin:/usr/local/sbin:/root/bin:/usr/local/bin:/usr/bin:/bin:/usr/X11R6/bin:/usr/games:/usr/lib/mit/bin:/usr/lib/mit/sbin:/var/lib/ambari-agent:/bin/:/usr/bin/:/usr/lib/hive/bin/:/usr/sbin/'
; ! beeline -u 'jdbc:hive2://dmitriusan-sles3-ru1-6.cs1cloud.internal:10000' -e '' 2>&1|
awk '{print}'|grep -i -e 'Connection refused' -e 'Invalid URL'
    ====================
    Agent out at: /var/log/ambari-agent/ambari-agent.out
    Agent log at: /var/log/ambari-agent/ambari-agent.log


Diffs
-----

  ambari-agent/src/test/python/resource_management/TestGroupResource.py 597a6ee 
  ambari-agent/src/test/python/resource_management/TestUserResource.py c946fed 
  ambari-common/src/main/python/resource_management/core/shell.py b1ab0bc 

Diff: https://reviews.apache.org/r/31946/diff/


Testing
-------

mvn clean test


Thanks,

Andrew Onischuk


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message