mesos-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Neil Conway (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MESOS-2186) Mesos crashes if any configured zookeeper does not resolve.
Date Wed, 21 Oct 2015 22:34:27 GMT

    [ https://issues.apache.org/jira/browse/MESOS-2186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14968099#comment-14968099
] 

Neil Conway commented on MESOS-2186:
------------------------------------

Ah, okay. So the situation seems to be:

(1) zookeeper_init() returns NULL when getaddrinfo() fails, as intended.
(2) Mesos is _designed_ to loop and retry zookeeper_init(), but it doesn't do this: we use
a gross hack to determine whether the zookeeper_init() failure was due to a hostname resolution
failure, and apparently it doesn't account for this case (we're expecting errno == EINVAL,
apparently we see ENOENT instead).
(3) Hence, we abort the process.

We can revise the condition we're checking in #2 slightly, but that is only intended as a
convenience anyway: as discussed above, you should be running Mesos under process supervision
and restarting it when it fails. (The question is just whether we do the retry loop in Mesos
itself or in the process supervisor.) If Mesos exiting unexpectedly "compromises the 'high
availability' of Mesos", your Mesos installation is not configured correctly.

> Mesos crashes if any configured zookeeper does not resolve.
> -----------------------------------------------------------
>
>                 Key: MESOS-2186
>                 URL: https://issues.apache.org/jira/browse/MESOS-2186
>             Project: Mesos
>          Issue Type: Bug
>    Affects Versions: 0.21.0, 0.26.0
>         Environment: Zookeeper:  3.4.5+28-1.cdh4.7.1.p0.13.el6
> Mesos: 0.21.0-1.0.centos65
> CentOS: CentOS release 6.6 (Final)
>            Reporter: Daniel Hall
>            Priority: Critical
>              Labels: mesosphere
>
> When starting Mesos, if one of the configured zookeeper servers does not resolve in DNS
Mesos will crash and refuse to start. We noticed this issue while we were rebuilding one of
our zookeeper hosts in Google compute (which bases the DNS on the machines running).
> Here is a log from a failed startup (hostnames and ip addresses have been sanitised).
> {noformat}
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.088835 28627 main.cpp:292]
Starting Mesos master
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 22:54:54,095:28627(0x7fa9f042f700):ZOO_ERROR@getaddrs@599:
getaddrinfo: No such file or directory
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.095239 28642 zookeeper.cpp:113]
Failed to create ZooKeeper, zookeeper_init: No such file or directory [2]
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack trace: ***
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 22:54:54,097:28627(0x7fa9ed22a700):ZOO_ERROR@getaddrs@599:
getaddrinfo: No such file or directory
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 28647 zookeeper.cpp:113]
Failed to create ZooKeeper, zookeeper_init: No such file or directory [2]
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack trace: ***
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]:     @     0x7fa9f56a0160  google::LogMessage::Fail()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]:     @     0x7fa9f56a0160  google::LogMessage::Fail()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]:     @     0x7fa9f56a00b9  google::LogMessage::SendToLog()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 22:54:54,108:28627(0x7fa9ef02d700):ZOO_ERROR@getaddrs@599:
getaddrinfo: No such file or directory
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 28647 zookeeper.cpp:113]
Failed to create ZooKeeper, zookeeper_init: No such file or directory [2]F1209 22:54:54.108422
28644 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such file or directory
[2]
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack trace: ***
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]:     @     0x7fa9f56a0160  google::LogMessage::Fail()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 22:54:54,109:28627(0x7fa9f0e30700):ZOO_ERROR@getaddrs@599:
getaddrinfo: No such file or directory
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 28647 zookeeper.cpp:113]
Failed to create ZooKeeper, zookeeper_init: No such file or directory [2]F1209 22:54:54.108422
28644 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such file or directory
[2]F1209 22:54:54.109864 28641 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init:
No such file or directory [2]
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack trace: ***
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]:     @     0x7fa9f56a0160  google::LogMessage::Fail()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]:     @     0x7fa9f56a00b9  google::LogMessage::SendToLog()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]:     @     0x7fa9f56a00b9  google::LogMessage::SendToLog()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.123208 28640 master.cpp:318]
Master 20141209-225454-4155764746-5050-28627 (mesosmaster-2.internal) started on 10.x.x.x:5050
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.123306 28640 master.cpp:366]
Master allowing unauthenticated frameworks to register
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.123327 28640 master.cpp:371]
Master allowing unauthenticated slaves to register
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]:     @     0x7fa9f56a00b9  google::LogMessage::SendToLog()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]:     @     0x7fa9f569fa97  google::LogMessage::Flush()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]:     @     0x7fa9f569fa97  google::LogMessage::Flush()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]:     @     0x7fa9f569fa97  google::LogMessage::Flush()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]:     @     0x7fa9f569f8af  google::LogMessage::~LogMessage()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]:     @     0x7fa9f56a086f  google::ErrnoLogMessage::~ErrnoLogMessage()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]:     @     0x7fa9f569fa97  google::LogMessage::Flush()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.159488 28643 contender.cpp:131]
Joining the ZK group
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.160753 28640 master.cpp:1202]
Successfully attached file '/var/log/mesos/mesos-master.INFO'
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]:     @     0x7fa9f569f8af  google::LogMessage::~LogMessage()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]:     @     0x7fa9f56a086f  google::ErrnoLogMessage::~ErrnoLogMessage()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]:     @     0x7fa9f569f8af  google::LogMessage::~LogMessage()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]:     @     0x7fa9f56a086f  google::ErrnoLogMessage::~ErrnoLogMessage()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]:     @     0x7fa9f569f8af  google::LogMessage::~LogMessage()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]:     @     0x7fa9f56a086f  google::ErrnoLogMessage::~ErrnoLogMessage()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]:     @     0x7fa9f5201abf  ZooKeeperProcess::initialize()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]:     @     0x7fa9f5604367  process::ProcessManager::resume()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]:     @     0x7fa9f5201abf  ZooKeeperProcess::initialize()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]:     @     0x7fa9f5201abf  ZooKeeperProcess::initialize()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]:     @     0x7fa9f5201abf  ZooKeeperProcess::initialize()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]:     @     0x7fa9f5604367  process::ProcessManager::resume()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]:     @     0x7fa9f5604367  process::ProcessManager::resume()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]:     @     0x7fa9f5604367  process::ProcessManager::resume()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]:     @     0x7fa9f55fa21f  process::schedule()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]:     @       0x3e498079d1  (unknown)
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]:     @       0x3e494e89dd  (unknown)
> Dec  9 22:54:54 mesosmaster-2 abrt[28650]: Not saving repeating crash in '/usr/local/sbin/mesos-master'
> Dec  9 22:54:54 mesosmaster-2 init: mesos-master main process (28627) killed by ABRT
signal
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message