mesos-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sameer Shah (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (MESOS-7174) Mesos Agent crashes when it is unable to reconnect to zookeeper
Date Sat, 25 Feb 2017 00:49:44 GMT

     [ https://issues.apache.org/jira/browse/MESOS-7174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Sameer Shah updated MESOS-7174:
-------------------------------
    Description: 
Mesos agent crashed when it was not able to reconnect to zookeeper. Here are relevant logs.
I have removed hostnames and ip's from the logs and replace with HOSTNAME_1, IP_1, etc.

{quote}
mesos-agent[23576]: 2017-02-23 05:09:36,718:23576(0x7f932ad69700):ZOO_ERROR@handle_socket_error_msg@1666:
Socket [IP1:2181] zk retcode=-7, errno=110(Connection timed out): connection to IP1:2181 timed
out (exceeded timeout by 4ms)
mesos-agent[23576]: I0223 05:09:36.719043 23592 group.cpp:460] Lost connection to ZooKeeper,
attempting to reconnect ...
mesos-agent[23576]: 2017-02-23 05:09:43,386:23576(0x7f932ad69700):ZOO_ERROR@handle_socket_error_msg@1666:
Socket [IP2:2181] zk retcode=-7, errno=110(Connection timed out): connection to IP2:2181 timed
out (exceeded timeout by 2ms)
mesos-agent[23576]: W0223 05:09:46.721179 23588 group.cpp:503] Timed out waiting to connect
to ZooKeeper. Forcing ZooKeeper session (sessionId=300007df99d24f2) expiration
mesos-agent[23576]: I0223 05:09:46.722100 23588 group.cpp:519] ZooKeeper session expired
mesos-agent[23576]: I0223 05:09:46.722249 23609 detector.cpp:152] Detected a new leader: None
mesos-agent[23576]: 2017-02-23 05:09:46,722:23576(0x7f932e989700):ZOO_INFO@zookeeper_close@2543:
Freeing zookeeper resources for sessionId=0x300007df99d24f2
mesos-agent[23576]: I0223 05:09:46.722776 23589 status_update_manager.cpp:174] Pausing sending
status updates
mesos-agent[23576]: I0223 05:09:46.722923 23607 slave.cpp:888] Lost leading master
mesos-agent[23576]: I0223 05:09:46.722960 23607 slave.cpp:927] Detecting new master
mesos-agent[23576]: 2017-02-23 05:09:46,731:23576(0x7f933198f700):ZOO_INFO@log_env@726: Client
environment:zookeeper.version=zookeeper C client 3.4.8
mesos-agent[23576]: 2017-02-23 05:09:46,731:23576(0x7f933198f700):ZOO_INFO@log_env@730: Client
environment:host.name=HOSTNAME
mesos-agent[23576]: 2017-02-23 05:09:46,731:23576(0x7f933198f700):ZOO_INFO@log_env@737: Client
environment:os.name=Linux
mesos-agent[23576]: 2017-02-23 05:09:46,731:23576(0x7f933198f700):ZOO_INFO@log_env@738: Client
environment:os.arch=####
mesos-agent[23576]: 2017-02-23 05:09:46,731:23576(0x7f933198f700):ZOO_INFO@log_env@739: Client
environment:os.version=####
mesos-agent[23576]: 2017-02-23 05:09:46,731:23576(0x7f933198f700):ZOO_INFO@log_env@747: Client
environment:user.name=(null)
mesos-agent[23576]: 2017-02-23 05:09:46,731:23576(0x7f933198f700):ZOO_INFO@log_env@755: Client
environment:user.home=/root
mesos-agent[23576]: 2017-02-23 05:09:46,731:23576(0x7f933198f700):ZOO_INFO@log_env@767: Client
environment:user.dir=/
mesos-agent[23576]: 2017-02-23 05:09:46,731:23576(0x7f933198f700):ZOO_INFO@zookeeper_init@800:
Initiating client connection, host=HOSTNAME_1:2181,HOSTNAME_2:2181,HOSTNAME_3:2181,HOSTNAME_4:2181,HOSTNAME_5:2181
sessionTimeout=10000 watcher=0x7f933f98c300 sessionId=0 sessionPasswd=<null> context=0x7f92d00406a0
flags=0
mesos-agent[23576]: W0223 05:10:20.030608 23608 slave.cpp:1480] Ignoring run task message
from master@IP:5050 because it is not the expected master: None
mesos-agent[23576]: 2017-02-23 05:10:26,906:23576(0x7f933198f700):ZOO_ERROR@getaddrs@613:
getaddrinfo: No such file or directory
mesos-agent[23576]: F0223 05:10:26.906946 23598 zookeeper.cpp:132] Failed to create ZooKeeper,
zookeeper_init: No such file or directory [2]
mesos-agent[23576]: *** Check failure stack trace: ***
mesos-agent[23576]: @     0x7f933fefc34d  google::LogMessage::Fail()
mesos-agent[23576]: @     0x7f933fefe08c  google::LogMessage::SendToLog()
mesos-agent[23576]: @     0x7f933fefbf3c  google::LogMessage::Flush()
mesos-agent[23576]: @     0x7f933fefc149  google::LogMessage::~LogMessage()
mesos-agent[23576]: @     0x7f933fefd0b2  google::ErrnoLogMessage::~ErrnoLogMessage()
mesos-agent[23576]: @     0x7f933f98cb88  ZooKeeperProcess::initialize()
mesos-agent[23576]: @     0x7f933fe8cca1  process::ProcessManager::resume()
mesos-agent[23576]: @     0x7f933fe8cf57  _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUt_vEEE6_M_runEv
mesos-agent[23576]: @     0x7f933e3341e0  (unknown)
mesos-agent[23576]: @     0x7f933e58ddc5  start_thread
mesos-agent[23576]: @     0x7f933dd9e28d  __clone
systemd[1]: mesos-agent.service: main process exited, code=killed, status=6/ABRT
systemd[1]: Unit mesos-agent.service entered failed state.
systemd[1]: mesos-agent.service failed.
systemd[1]: mesos-agent.service holdoff time over, scheduling restart.
systemd[1]: Started Mesos Agent.
systemd[1]: Starting Mesos Agent...
{quote}

  was:
Mesos agent crashed when it was not able to reconnect to zookeeper. Here are relevant logs.
I have removed hostnames and ip's from the logs and replace with HOSTNAME_1, IP_1, etc.

{quote}
systemd[1]: Starting Mesos Agent...
systemd[1]: Started Mesos Agent.
systemd[1]: mesos-agent.service holdoff time over, scheduling restart.
systemd[1]: mesos-agent.service failed.
systemd[1]: Unit mesos-agent.service entered failed state.
systemd[1]: mesos-agent.service: main process exited, code=killed, status=6/ABRT
mesos-agent[23576]: @     0x7f933dd9e28d  __clone
mesos-agent[23576]: @     0x7f933e58ddc5  start_thread
mesos-agent[23576]: @     0x7f933e3341e0  (unknown)
mesos-agent[23576]: @     0x7f933fe8cf57  _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUt_vEEE6_M_runEv
mesos-agent[23576]: @     0x7f933fe8cca1  process::ProcessManager::resume()
mesos-agent[23576]: @     0x7f933f98cb88  ZooKeeperProcess::initialize()
mesos-agent[23576]: @     0x7f933fefd0b2  google::ErrnoLogMessage::~ErrnoLogMessage()
mesos-agent[23576]: @     0x7f933fefc149  google::LogMessage::~LogMessage()
mesos-agent[23576]: @     0x7f933fefbf3c  google::LogMessage::Flush()
mesos-agent[23576]: @     0x7f933fefe08c  google::LogMessage::SendToLog()
mesos-agent[23576]: @     0x7f933fefc34d  google::LogMessage::Fail()
mesos-agent[23576]: *** Check failure stack trace: ***
mesos-agent[23576]: F0223 05:10:26.906946 23598 zookeeper.cpp:132] Failed to create ZooKeeper,
zookeeper_init: No such file or directory [2]
mesos-agent[23576]: 2017-02-23 05:10:26,906:23576(0x7f933198f700):ZOO_ERROR@getaddrs@613:
getaddrinfo: No such file or directory
mesos-agent[23576]: W0223 05:10:20.030608 23608 slave.cpp:1480] Ignoring run task message
from master@IP:5050 because it is not the expected master: None
mesos-agent[23576]: 2017-02-23 05:09:46,731:23576(0x7f933198f700):ZOO_INFO@zookeeper_init@800:
Initiating client connection, host=HOSTNAME_1:2181,HOSTNAME_2:2181,HOSTNAME_3:2181,HOSTNAME_4:2181,HOSTNAME_5:2181
sessionTimeout=10000 watcher=0x7f933f98c300 sessionId=0 sessionPasswd=<null> context=0x7f92d00406a0
flags=0
mesos-agent[23576]: 2017-02-23 05:09:46,731:23576(0x7f933198f700):ZOO_INFO@log_env@767: Client
environment:user.dir=/
mesos-agent[23576]: 2017-02-23 05:09:46,731:23576(0x7f933198f700):ZOO_INFO@log_env@755: Client
environment:user.home=/root
mesos-agent[23576]: 2017-02-23 05:09:46,731:23576(0x7f933198f700):ZOO_INFO@log_env@747: Client
environment:user.name=(null)
mesos-agent[23576]: 2017-02-23 05:09:46,731:23576(0x7f933198f700):ZOO_INFO@log_env@739: Client
environment:os.version=####
mesos-agent[23576]: 2017-02-23 05:09:46,731:23576(0x7f933198f700):ZOO_INFO@log_env@738: Client
environment:os.arch=####
mesos-agent[23576]: 2017-02-23 05:09:46,731:23576(0x7f933198f700):ZOO_INFO@log_env@737: Client
environment:os.name=Linux
mesos-agent[23576]: 2017-02-23 05:09:46,731:23576(0x7f933198f700):ZOO_INFO@log_env@730: Client
environment:host.name=HOSTNAME
mesos-agent[23576]: 2017-02-23 05:09:46,731:23576(0x7f933198f700):ZOO_INFO@log_env@726: Client
environment:zookeeper.version=zookeeper C client 3.4.8
mesos-agent[23576]: I0223 05:09:46.722960 23607 slave.cpp:927] Detecting new master
mesos-agent[23576]: I0223 05:09:46.722923 23607 slave.cpp:888] Lost leading master
mesos-agent[23576]: I0223 05:09:46.722776 23589 status_update_manager.cpp:174] Pausing sending
status updates
mesos-agent[23576]: 2017-02-23 05:09:46,722:23576(0x7f932e989700):ZOO_INFO@zookeeper_close@2543:
Freeing zookeeper resources for sessionId=0x300007df99d24f2
mesos-agent[23576]: I0223 05:09:46.722249 23609 detector.cpp:152] Detected a new leader: None
mesos-agent[23576]: I0223 05:09:46.722100 23588 group.cpp:519] ZooKeeper session expired
mesos-agent[23576]: W0223 05:09:46.721179 23588 group.cpp:503] Timed out waiting to connect
to ZooKeeper. Forcing ZooKeeper session (sessionId=300007df99d24f2) expiration
mesos-agent[23576]: 2017-02-23 05:09:43,386:23576(0x7f932ad69700):ZOO_ERROR@handle_socket_error_msg@1666:
Socket [IP2:2181] zk retcode=-7, errno=110(Connection timed out): connection to IP2:2181 timed
out (exceeded timeout by 2ms)
mesos-agent[23576]: I0223 05:09:36.719043 23592 group.cpp:460] Lost connection to ZooKeeper,
attempting to reconnect ...
mesos-agent[23576]: 2017-02-23 05:09:36,718:23576(0x7f932ad69700):ZOO_ERROR@handle_socket_error_msg@1666:
Socket [IP1:2181] zk retcode=-7, errno=110(Connection timed out): connection to IP1:2181 timed
out (exceeded timeout by 4ms)
{quote}


> Mesos Agent crashes when it is unable to reconnect to zookeeper
> ---------------------------------------------------------------
>
>                 Key: MESOS-7174
>                 URL: https://issues.apache.org/jira/browse/MESOS-7174
>             Project: Mesos
>          Issue Type: Bug
>          Components: agent
>            Reporter: Sameer Shah
>
> Mesos agent crashed when it was not able to reconnect to zookeeper. Here are relevant
logs. I have removed hostnames and ip's from the logs and replace with HOSTNAME_1, IP_1, etc.
> {quote}
> mesos-agent[23576]: 2017-02-23 05:09:36,718:23576(0x7f932ad69700):ZOO_ERROR@handle_socket_error_msg@1666:
Socket [IP1:2181] zk retcode=-7, errno=110(Connection timed out): connection to IP1:2181 timed
out (exceeded timeout by 4ms)
> mesos-agent[23576]: I0223 05:09:36.719043 23592 group.cpp:460] Lost connection to ZooKeeper,
attempting to reconnect ...
> mesos-agent[23576]: 2017-02-23 05:09:43,386:23576(0x7f932ad69700):ZOO_ERROR@handle_socket_error_msg@1666:
Socket [IP2:2181] zk retcode=-7, errno=110(Connection timed out): connection to IP2:2181 timed
out (exceeded timeout by 2ms)
> mesos-agent[23576]: W0223 05:09:46.721179 23588 group.cpp:503] Timed out waiting to connect
to ZooKeeper. Forcing ZooKeeper session (sessionId=300007df99d24f2) expiration
> mesos-agent[23576]: I0223 05:09:46.722100 23588 group.cpp:519] ZooKeeper session expired
> mesos-agent[23576]: I0223 05:09:46.722249 23609 detector.cpp:152] Detected a new leader:
None
> mesos-agent[23576]: 2017-02-23 05:09:46,722:23576(0x7f932e989700):ZOO_INFO@zookeeper_close@2543:
Freeing zookeeper resources for sessionId=0x300007df99d24f2
> mesos-agent[23576]: I0223 05:09:46.722776 23589 status_update_manager.cpp:174] Pausing
sending status updates
> mesos-agent[23576]: I0223 05:09:46.722923 23607 slave.cpp:888] Lost leading master
> mesos-agent[23576]: I0223 05:09:46.722960 23607 slave.cpp:927] Detecting new master
> mesos-agent[23576]: 2017-02-23 05:09:46,731:23576(0x7f933198f700):ZOO_INFO@log_env@726:
Client environment:zookeeper.version=zookeeper C client 3.4.8
> mesos-agent[23576]: 2017-02-23 05:09:46,731:23576(0x7f933198f700):ZOO_INFO@log_env@730:
Client environment:host.name=HOSTNAME
> mesos-agent[23576]: 2017-02-23 05:09:46,731:23576(0x7f933198f700):ZOO_INFO@log_env@737:
Client environment:os.name=Linux
> mesos-agent[23576]: 2017-02-23 05:09:46,731:23576(0x7f933198f700):ZOO_INFO@log_env@738:
Client environment:os.arch=####
> mesos-agent[23576]: 2017-02-23 05:09:46,731:23576(0x7f933198f700):ZOO_INFO@log_env@739:
Client environment:os.version=####
> mesos-agent[23576]: 2017-02-23 05:09:46,731:23576(0x7f933198f700):ZOO_INFO@log_env@747:
Client environment:user.name=(null)
> mesos-agent[23576]: 2017-02-23 05:09:46,731:23576(0x7f933198f700):ZOO_INFO@log_env@755:
Client environment:user.home=/root
> mesos-agent[23576]: 2017-02-23 05:09:46,731:23576(0x7f933198f700):ZOO_INFO@log_env@767:
Client environment:user.dir=/
> mesos-agent[23576]: 2017-02-23 05:09:46,731:23576(0x7f933198f700):ZOO_INFO@zookeeper_init@800:
Initiating client connection, host=HOSTNAME_1:2181,HOSTNAME_2:2181,HOSTNAME_3:2181,HOSTNAME_4:2181,HOSTNAME_5:2181
sessionTimeout=10000 watcher=0x7f933f98c300 sessionId=0 sessionPasswd=<null> context=0x7f92d00406a0
flags=0
> mesos-agent[23576]: W0223 05:10:20.030608 23608 slave.cpp:1480] Ignoring run task message
from master@IP:5050 because it is not the expected master: None
> mesos-agent[23576]: 2017-02-23 05:10:26,906:23576(0x7f933198f700):ZOO_ERROR@getaddrs@613:
getaddrinfo: No such file or directory
> mesos-agent[23576]: F0223 05:10:26.906946 23598 zookeeper.cpp:132] Failed to create ZooKeeper,
zookeeper_init: No such file or directory [2]
> mesos-agent[23576]: *** Check failure stack trace: ***
> mesos-agent[23576]: @     0x7f933fefc34d  google::LogMessage::Fail()
> mesos-agent[23576]: @     0x7f933fefe08c  google::LogMessage::SendToLog()
> mesos-agent[23576]: @     0x7f933fefbf3c  google::LogMessage::Flush()
> mesos-agent[23576]: @     0x7f933fefc149  google::LogMessage::~LogMessage()
> mesos-agent[23576]: @     0x7f933fefd0b2  google::ErrnoLogMessage::~ErrnoLogMessage()
> mesos-agent[23576]: @     0x7f933f98cb88  ZooKeeperProcess::initialize()
> mesos-agent[23576]: @     0x7f933fe8cca1  process::ProcessManager::resume()
> mesos-agent[23576]: @     0x7f933fe8cf57  _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUt_vEEE6_M_runEv
> mesos-agent[23576]: @     0x7f933e3341e0  (unknown)
> mesos-agent[23576]: @     0x7f933e58ddc5  start_thread
> mesos-agent[23576]: @     0x7f933dd9e28d  __clone
> systemd[1]: mesos-agent.service: main process exited, code=killed, status=6/ABRT
> systemd[1]: Unit mesos-agent.service entered failed state.
> systemd[1]: mesos-agent.service failed.
> systemd[1]: mesos-agent.service holdoff time over, scheduling restart.
> systemd[1]: Started Mesos Agent.
> systemd[1]: Starting Mesos Agent...
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Mime
View raw message