mesos-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Benno Evers (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MESOS-8703) Mesos master can`t reconnect to zookeeper
Date Wed, 21 Mar 2018 19:12:00 GMT

    [ https://issues.apache.org/jira/browse/MESOS-8703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16408442#comment-16408442
] 

Benno Evers commented on MESOS-8703:
------------------------------------

The original zookeeper crash might well be caused by MESOS-8550.

However, usually this should just result in a crash and subsequent restart of the master.
Instead, the master seems to lock up during shutdown. The cause might be a similar issue as
in MESOS-1477, although I couldn't see any suspicious changes to the related files for version
1.4.1.

If this issue is somewhat reproducible, it would probably be helpful to include stack traces
for all threads when the master becomes unresponsive.

 

> Mesos master can`t reconnect to zookeeper 
> ------------------------------------------
>
>                 Key: MESOS-8703
>                 URL: https://issues.apache.org/jira/browse/MESOS-8703
>             Project: Mesos
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 1.4.1
>            Reporter: Anton Malevich
>            Priority: Blocker
>
> Mesos master can`t reconnect to zookeeper after zookeeper hangs.
> {noformat}
> 2018-03-20 10:16:45,608:1(0x2ae675db6700):ZOO_ERROR@handle_socket_error_msg@1666: Socket
[<zknode1>:2181] zk retcode=-7, errno=110(Connection timed out): connection to <zknode1>:2181
timed out (exceeded timeout by 3ms)
> 2018-03-20 10:16:45,609:1(0x2ae675db6700):ZOO_INFO@check_events@1728: initiated connection
to server [<zknode2>:2181]
> 2018-03-20 10:16:45,619:1(0x2ae675db6700):ZOO_ERROR@handle_socket_error_msg@1764: Socket
[<zknode2>:2181] zk retcode=-112, errno=116(Stale file handle): sessionId=0x5623d0e483dd435
has expired.
> I0320 10:16:45.620604    18 group.cpp:511] ZooKeeper session expired
> I0320 10:16:45.620802    16 detector.cpp:152] Detected a new leader: None
> I0320 10:16:45.620957    16 master.cpp:2176] The newly elected leader is None
> mesos-master: ../../3rdparty/stout/include/stout/option.hpp:112: T& Option<T>::get()
& [with T = mesos::MasterInfo]: Assertion `isSome()' failed.
> *** Aborted at 1521541005 (unix time) try "date -d @1521541005" if you are using GNU
date ***
> PC: @     0x2ae63d2b9428 (unknown)
> *** SIGABRT (@0x1) received by PID 1 (TID 0x2ae648ffa700) from PID 1; stack trace: ***
>     @     0x2ae63d078390 (unknown)
>     @     0x2ae63d2b9428 (unknown)
>     @     0x2ae63d2bb02a (unknown)
>     @     0x2ae63d2b1bd7 (unknown)
>     @     0x2ae63d2b1c82 (unknown)
> 2018-03-20 10:16:45,622:1(0x2ae649ffc700):ZOO_INFO@zookeeper_close@2543: Freeing zookeeper
resources for sessionId=0x5623d0e483dd435
> 2018-03-20 10:16:45,623:1(0x2ae6477f7700):ZOO_INFO@log_env@726: Client environment:zookeeper.version=zookeeper
C client 3.4.8
> 2018-03-20 10:16:45,623:1(0x2ae6477f7700):ZOO_INFO@log_env@730: Client environment:host.name=<mesos_hostname>
> 2018-03-20 10:16:45,623:1(0x2ae6477f7700):ZOO_INFO@log_env@737: Client environment:os.name=Linux
> 2018-03-20 10:16:45,623:1(0x2ae6477f7700):ZOO_INFO@log_env@738: Client environment:os.arch=4.8.15-1.el7.wg.x86_64
> 2018-03-20 10:16:45,623:1(0x2ae6477f7700):ZOO_INFO@log_env@739: Client environment:os.version=#1
SMP Mon Dec 26 14:34:45 UTC 2016
> 2018-03-20 10:16:45,624:1(0x2ae6477f7700):ZOO_INFO@log_env@747: Client environment:user.name=(null)
> 2018-03-20 10:16:45,624:1(0x2ae6477f7700):ZOO_INFO@log_env@755: Client environment:user.home=/root
> 2018-03-20 10:16:45,624:1(0x2ae6477f7700):ZOO_INFO@log_env@767: Client environment:user.dir=/
> 2018-03-20 10:16:45,624:1(0x2ae6477f7700):ZOO_INFO@zookeeper_init@800: Initiating client
connection, host=<zk_pool> sessionTimeout=10000 watcher=0x2ae63b3711e0 sessionId=0 sessionPasswd=<null>
context=0x2ae6900036f8 flags=0
>     @     0x2ae63ad6b55b mesos::internal::master::Master::detected()
>     @     0x2ae63b9e4cfc process::ProcessBase::visit()
> 2018-03-20 10:16:45,634:1(0x2ae6765b7700):ZOO_INFO@check_events@1728: initiated connection
to server [<zknode1>:2181]
>     @     0x2ae63b9fac84 process::ProcessManager::resume()
>     @     0x2ae63b9fd5e6 _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUlvE_vEEE6_M_runEv
>     @     0x2ae63c87ec80 (unknown)
>     @     0x2ae63d06e6ba start_thread
>     @     0x2ae63d38b3dd (unknown)
> 2018-03-20 10:16:45,651:1(0x2ae6765b7700):ZOO_INFO@check_events@1775: session establishment
complete on server [<zknode1>:2181], sessionId=0x1623f43348692c7, negotiated timeout=10000
> I0320 10:16:45.651684    15 group.cpp:341] Group process (zookeeper-group(2)@<mesos4>:5050)
connected to ZooKeeper
> I0320 10:16:45.651733    15 group.cpp:831] Syncing group operations: queue size (joins,
cancels, datas) = (0, 0, 0)
> I0320 10:16:45.651743    15 group.cpp:419] Trying to create path '/mesos' in ZooKeeper
> I0320 10:16:45.676736    15 detector.cpp:152] Detected a new leader: (id='704')
> I0320 10:16:45.676844    15 group.cpp:700] Trying to get '/mesos/json.info_0000000704'
in ZooKeeper
> I0320 10:16:45.683346    15 zookeeper.cpp:262] A new leading master (UPID=master@<mesos4>:5050)
is detected
> {noformat}
> After this, mesos master do not answer for http requests, and leader election do not
happens.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message