mesos-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Benjamin Mahler (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MESOS-2014) error of Recovery failed: Failed to recover registrar: Failed to perform fetch within 5mins
Date Mon, 03 Nov 2014 18:55:34 GMT

    [ https://issues.apache.org/jira/browse/MESOS-2014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14194888#comment-14194888
] 

Benjamin Mahler commented on MESOS-2014:
----------------------------------------

Hi [~jesson], you need to keep a quorum of masters online for a master to successfully recover.
Typically this means running the master under something (like Monit) that ensures that a downed
master process will be restarted promptly, on the order of seconds. Are you doing that?

> error of Recovery failed: Failed to recover registrar: Failed to perform fetch within
5mins
> -------------------------------------------------------------------------------------------
>
>                 Key: MESOS-2014
>                 URL: https://issues.apache.org/jira/browse/MESOS-2014
>             Project: Mesos
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 0.20.1
>         Environment: CentOS 6.3
>  3.10.5-12.1.x86_64 #1 SMP Fri Aug 16 01:42:38 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux
>            Reporter: Ji Huang
>
> I set  up a mesos master cluster with 3 nodes. at the first, everything goes well, but
when the leader master had dead, other candidate node  can not recovery and elect new leader,
all of candidate node will dead too.
> I1030 15:01:32.005691  6741 detector.cpp:138] Detected a new leader: (id='16')
> I1030 15:01:32.005692  6737 network.hpp:423] ZooKeeper group memberships changed
> I1030 15:01:32.006089  6741 group.cpp:658] Trying to get '/mesos/info_0000000016' in
ZooKeeper
> I1030 15:01:32.006222  6738 group.cpp:658] Trying to get '/mesos/log_replicas/0000000015'
in ZooKeeper
> I1030 15:01:32.007230  6738 group.cpp:658] Trying to get '/mesos/log_replicas/0000000016'
in ZooKeeper
> I1030 15:01:32.007268  6736 detector.cpp:426] A new leading master (UPID=master@10.99.169.5:5050)
is detected
> I1030 15:01:32.007546  6742 master.cpp:1196] The newly elected leader is master@10.99.169.5:5050
with id 20141030-150042-94987018-5050-6735
> I1030 15:01:32.007640  6742 master.cpp:1209] Elected as the leading master!
> I1030 15:01:32.007730  6742 master.cpp:1027] Recovering from registrar
> I1030 15:01:32.007895  6736 registrar.cpp:313] Recovering registrar
> I1030 15:01:32.008388  6742 network.hpp:461] ZooKeeper group PIDs: { log-replica(1)@10.99.169.5:5050,
log-replica(1)@10.99.169.6:5050 }
> I1030 15:01:32.051316  6742 replica.cpp:638] Replica in EMPTY status received a broadcasted
recover request
> I1030 15:01:32.889194  6738 replica.cpp:638] Replica in EMPTY status received a broadcasted
recover request
> I1030 15:01:33.469511  6743 replica.cpp:638] Replica in EMPTY status received a broadcasted
recover request
> I1030 15:01:34.324684  6740 replica.cpp:638] Replica in EMPTY status received a broadcasted
recover request
> I1030 15:01:35.263629  6736 replica.cpp:638] Replica in EMPTY status received a broadcasted
recover request
> I1030 15:01:36.212492  6739 replica.cpp:638] Replica in EMPTY status received a broadcasted
recover request
> I1030 15:01:37.015682  6742 replica.cpp:638] Replica in EMPTY status received a broadcasted
recover request
> I1030 15:01:37.781746  6743 replica.cpp:638] Replica in EMPTY status received a broadcasted
recover request
> I1030 15:01:38.494547  6737 replica.cpp:638] Replica in EMPTY status received a broadcasted
recover request
> I1030 15:01:39.186830  6740 replica.cpp:638] Replica in EMPTY status received a broadcasted
recover request
> I1030 15:01:40.072258  6736 replica.cpp:638] Replica in EMPTY status received a broadcasted
recover request
> I1030 15:01:40.855337  6743 replica.cpp:638] Replica in EMPTY status received a broadcasted
recover request
> I1030 15:01:41.516916  6739 replica.cpp:638] Replica in EMPTY status received a broadcasted
recover request
> I1030 15:01:41.556437  6744 recover.cpp:111] Unable to finish the recover protocol in
10secs, retrying
> I1030 15:01:41.557253  6741 replica.cpp:638] Replica in EMPTY status received a broadcasted
recover request
> I1030 15:01:41.557502  6739 recover.cpp:188] Received a recover response from a replica
in EMPTY status
> I1030 15:01:41.558156  6741 recover.cpp:188] Received a recover response from a replica
in EMPTY status
> I1030 15:01:42.153370  6737 replica.cpp:638] Replica in EMPTY status received a broadcasted
recover request
> I1030 15:01:42.505698  6742 replica.cpp:638] Replica in EMPTY status received a broadcasted
recover request
> I1030 15:01:42.506060  6738 recover.cpp:188] Received a recover response from a replica
in EMPTY status
> I1030 15:01:42.507046  6742 recover.cpp:188] Received a recover response from a replica
in EMPTY status
> ......
> F1030 15:06:32.009464  6741 master.cpp:1016] Recovery failed: Failed to recover registrar:
Failed to perform fetch within 5mins
> Core dump info:
> #0  0x0000003d636328a5 in raise () from /lib64/libc.so.6
> #1  0x0000003d63634085 in abort () from /lib64/libc.so.6
> #2  0x00007f7a452f0e19 in google::DumpStackTraceAndExit () at src/utilities.cc:147
> #3  0x00007f7a452e7d5d in google::LogMessage::Fail () at src/logging.cc:1458
> #4  0x00007f7a452ebd77 in google::LogMessage::SendToLog (this=0x7f7a41d8f9d0) at src/logging.cc:1412
> #5  0x00007f7a452e9bf9 in google::LogMessage::Flush (this=0x7f7a41d8f9d0) at src/logging.cc:1281
> #6  0x00007f7a452e9efd in google::LogMessageFatal::~LogMessageFatal (this=0x7f7a41d8f9d0,
__in_chrg=<value optimized out>) at src/logging.cc:1984
> #7  0x00007f7a44d6759c in mesos::internal::master::fail (message="Recovery failed", failure="Failed
to recover registrar: Failed to perform fetch within 5mins") at ../../src/master/master.cpp:1016
> #8  0x00007f7a44da75a6 in __call<std::basic_string<char, std::char_traits<char>,
std::allocator<char> > const&, 0, 1> (__functor=<value optimized out>,
__args#0=
>     "Failed to recover registrar: Failed to perform fetch within 5mins") at /usr/lib/gcc/x86_64-redhat-linux/4.4.6/../../../../include/c++/4.4.6/tr1_impl/functional:1137
> #9  operator()<const std::basic_string<char, std::char_traits<char>, std::allocator<char>
> > (__functor=<value optimized out>, __args#0="Failed to recover registrar: Failed
to perform fetch within 5mins")
>     at /usr/lib/gcc/x86_64-redhat-linux/4.4.6/../../../../include/c++/4.4.6/tr1_impl/functional:1191
> #10 std::tr1::_Function_handler<void(const std::string&), std::tr1::_Bind<void
(*(const char*, std::tr1::_Placeholder<1>))(const std::string&, const std::string&)>
>::_M_invoke(const std::tr1::_Any_data &, const std::string &) (__functor=<value
optimized out>, __args#0="Failed to recover registrar: Failed to perform fetch within 5mins")
>     at /usr/lib/gcc/x86_64-redhat-linux/4.4.6/../../../../include/c++/4.4.6/tr1_impl/functional:1668
> #11 0x00007f7a44caff3c in process::Future<Nothing>::fail (this=0x7f7a140164f8,
_message=<value optimized out>) at ../../3rdparty/libprocess/include/process/future.hpp:1628
> #12 0x00007f7a44de1a6a in fail (promise=std::tr1::shared_ptr (count 1) 0x7f7a140164f0,
f=..., future=<value optimized out>) at ../../3rdparty/libprocess/include/process/future.hpp:789
> #13 process::internal::thenf<mesos::internal::Registry, Nothing>(const std::tr1::shared_ptr<process::Promise<Nothing>
> &, const std::tr1::function<process::Future<Nothing>(const mesos::internal::Registry&)>
&, const process::Future<mesos::internal::Registry> &) (promise=std::tr1::shared_ptr
(count 1) 0x7f7a140164f0, f=..., future=<value optimized out>) at ../../3rdparty/libprocess/include/process/future.hpp:1438
> #14 0x00007f7a44e18ffc in process::Future<mesos::internal::Registry>::fail (this=0x7f7a2800be68,
_message=<value optimized out>) at ../../3rdparty/libprocess/include/process/future.hpp:1634
> #15 0x00007f7a44e18f9c in process::Future<mesos::internal::Registry>::fail (this=0x7f7a2801c488,
_message=<value optimized out>) at ../../3rdparty/libprocess/include/process/future.hpp:1628
> #16 0x00007f7a44e0cf4c in fail (this=0x2179b80, info=<value optimized out>, recovery=<value
optimized out>) at ../../3rdparty/libprocess/include/process/future.hpp:789
> #17 mesos::internal::master::RegistrarProcess::_recover (this=0x2179b80, info=<value
optimized out>, recovery=<value optimized out>) at ../../src/master/registrar.cpp:341
> #18 0x00007f7a44e24181 in __call<process::ProcessBase*&, 0, 1> (__functor=<value
optimized out>, __args#0=<value optimized out>)
>     at /usr/lib/gcc/x86_64-redhat-linux/4.4.6/../../../../include/c++/4.4.6/tr1_impl/functional:1137
> #19 operator()<process::ProcessBase*> (__functor=<value optimized out>, __args#0=<value
optimized out>) at /usr/lib/gcc/x86_64-redhat-linux/4.4.6/../../../../include/c++/4.4.6/tr1_impl/functional:1191
> #20 std::tr1::_Function_handler<void(process::ProcessBase*), std::tr1::_Bind<void
(*(std::tr1::_Placeholder<1>, std::tr1::shared_ptr<std::tr1::function<void(mesos::internal::master::RegistrarProcess*)>
>))(process::ProcessBase*, std::tr1::shared_ptr<std::tr1::function<void(mesos::internal::master::RegistrarProcess*)>
>)> >::_M_invoke(const std::tr1::_Any_data &, process::ProcessBase *) (__functor=<value
optimized out>, 
>     __args#0=<value optimized out>) at /usr/lib/gcc/x86_64-redhat-linux/4.4.6/../../../../include/c++/4.4.6/tr1_impl/functional:1668
> #21 0x00007f7a452814f4 in process::ProcessManager::resume (this=0x214b690, process=0x2179e28)
at ../../../3rdparty/libprocess/src/process.cpp:2848
> #22 0x00007f7a45281dec in process::schedule (arg=<value optimized out>) at ../../../3rdparty/libprocess/src/process.cpp:1479
> #23 0x0000003d63a07851 in start_thread () from /lib64/libpthread.so.0
> #24 0x0000003d636e811d in clone () from /lib64/libc.so.6



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message