Return-Path: X-Original-To: apmail-mesos-issues-archive@minotaur.apache.org Delivered-To: apmail-mesos-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 8CE2019550 for ; Tue, 12 Apr 2016 17:23:26 +0000 (UTC) Received: (qmail 11651 invoked by uid 500); 12 Apr 2016 17:23:25 -0000 Delivered-To: apmail-mesos-issues-archive@mesos.apache.org Received: (qmail 11551 invoked by uid 500); 12 Apr 2016 17:23:25 -0000 Mailing-List: contact issues-help@mesos.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@mesos.apache.org Delivered-To: mailing list issues@mesos.apache.org Received: (qmail 11363 invoked by uid 99); 12 Apr 2016 17:23:25 -0000 Received: from arcas.apache.org (HELO arcas) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 12 Apr 2016 17:23:25 +0000 Received: from arcas.apache.org (localhost [127.0.0.1]) by arcas (Postfix) with ESMTP id 717A12C1F5A for ; Tue, 12 Apr 2016 17:23:25 +0000 (UTC) Date: Tue, 12 Apr 2016 17:23:25 +0000 (UTC) From: "Priyanka Gupta (JIRA)" To: issues@mesos.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (MESOS-5193) Recovery failed: Failed to recover registrar on reboot of mesos master MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/MESOS-5193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15237577#comment-15237577 ] Priyanka Gupta commented on MESOS-5193: --------------------------------------- Error Stack in mesos master log Node3 I0411 22:47:02.007249 1348 detector.cpp:479] A new leading master (UPID=master@10.221.28.61:5050) is detected I0411 22:47:02.007380 1348 master.cpp:1710] The newly elected leader is master@10.221.28.61:5050 with id 725d1232-bea3-4df5-90c5-6479e5652ef4 I0411 22:47:02.007428 1348 master.cpp:1723] Elected as the leading master! I0411 22:47:02.007457 1348 master.cpp:1468] Recovering from registrar I0411 22:47:02.007551 1345 registrar.cpp:307] Recovering registrar I0411 22:47:02.007649 1356 network.hpp:461] ZooKeeper group PIDs: { log-replica(1)@10.221.28.61:5050, log-replica(1)@10.221.28.249:5050 } I0411 22:47:02.007841 1356 log.cpp:659] Attempting to start the writer I0411 22:47:02.008477 1348 replica.cpp:493] Replica received implicit promise request from (30)@10.221.28.61:5050 with proposal 52 E0411 22:47:02.008903 1358 process.cpp:1966] Failed to shutdown socket with fd 23: Transport endpoint is not connected I0411 22:47:02.009968 1348 leveldb.cpp:304] Persisting metadata (8 bytes) to leveldb took 1.44126ms I0411 22:47:02.010022 1348 replica.cpp:342] Persisted promised to 52 F0411 22:48:02.008332 1357 master.cpp:1457] Recovery failed: Failed to recover registrar: Failed to perform fetch within 1mins *** Check failure stack trace: *** @ 0x7f4bd5bcedfd (unknown) @ 0x7f4bd5bd0c3d (unknown) @ 0x7f4bd5bce9ec (unknown) @ 0x7f4bd5bd1539 (unknown) @ 0x7f4bd54022dc (unknown) @ 0x7f4bd5442ab0 (unknown) @ 0x42807e (unknown) @ 0x7f4bd54690a5 (unknown) @ 0x7f4bd54bb976 (unknown) @ 0x7f4bd54cc566 (unknown) @ 0x7f4bd52fc4d6 (unknown) @ 0x7f4bd54cc553 (unknown) @ 0x7f4bd54b0614 (unknown) @ 0x7f4bd5b7c971 (unknown) @ 0x7f4bd5b7cc77 (unknown) @ 0x3dc38b6470 (unknown) @ 0x3dc18079d1 (unknown) @ 0x3dc14e88fd (unknown) @ (nil) (unknown) /bin/bash: line 1: 1313 Aborted /usr/sbin/mesos-master --work_dir=/tmp/mesos_dir --zk=zk://scheduler1.rpt.cb.ne1.yahoo.com:2181,scheduler2.rpt.cb.ne1.yahoo.com:2181,scheduler3.rpt.cb.ne1.yahoo.com:2181/mesos --quorum=2 Node 2 I0411 22:48:10.006216 1466 log.cpp:659] Attempting to start the writer E0411 22:48:10.006958 1478 process.cpp:1966] Failed to shutdown socket with fd 23: Transport endpoint is not connected I0411 22:48:10.007202 1467 replica.cpp:493] Replica received implicit promise request from (13)@10.221.28.249:5050 with proposal 52 E0411 22:48:10.007491 1478 process.cpp:1966] Failed to shutdown socket with fd 23: Transport endpoint is not connected I0411 22:48:10.008458 1467 leveldb.cpp:304] Persisting metadata (8 bytes) to leveldb took 1.227092ms I0411 22:48:10.008491 1467 replica.cpp:342] Persisted promised to 52 F0411 22:49:10.006739 1476 master.cpp:1457] Recovery failed: Failed to recover registrar: Failed to perform fetch within 1mins *** Check failure stack trace: *** @ 0x7fec686f2dfd (unknown) @ 0x7fec686f4c3d (unknown) @ 0x7fec686f29ec (unknown) @ 0x7fec686f5539 (unknown) @ 0x7fec67f262dc (unknown) @ 0x7fec67f66ab0 (unknown) @ 0x42807e (unknown) @ 0x7fec67f8d0a5 (unknown) @ 0x7fec67fdf976 (unknown) @ 0x7fec67ff0566 (unknown) @ 0x7fec67e204d6 (unknown) @ 0x7fec67ff0553 (unknown) @ 0x7fec67fd4614 (unknown) @ 0x7fec686a0971 (unknown) @ 0x7fec686a0c77 (unknown) @ 0x37f98b6470 (unknown) @ 0x39ed207a51 (unknown) @ 0x39ecae89ad (unknown) @ (nil) (unknown) /bin/bash: line 1: 1452 Aborted /usr/sbin/mesos-master --work_dir=/tmp/mesos_dir --zk=zk://scheduler1.rpt.cb.ne1.yahoo.com:2181,scheduler2.rpt.cb.ne1.yahoo.com:2181,scheduler3.rpt.cb.ne1.yahoo.com:2181/mesos --quorum=2 Node 1 I0411 22:45:52.017833 8338 detector.cpp:479] A new leading master (UPID=master@10.221.29.247:5050) is detected I0411 22:45:52.017925 8338 master.cpp:1710] The newly elected leader is master@10.221.29.247:5050 with id 13df6437-fbe9-4390-9f6c-db9fd1d53a16 I0411 22:45:52.017956 8338 master.cpp:1723] Elected as the leading master! I0411 22:45:52.017983 8338 master.cpp:1468] Recovering from registrar I0411 22:45:52.018069 8339 registrar.cpp:307] Recovering registrar I0411 22:45:52.018337 8333 log.cpp:659] Attempting to start the writer I0411 22:45:52.018785 8336 network.hpp:461] ZooKeeper group PIDs: { log-replica(1)@10.221.28.61:5050, log-replica(1)@10.221.29.247:5050 } I0411 22:45:52.019008 8336 replica.cpp:493] Replica received implicit promise request from (31)@10.221.29.247:5050 with proposal 50 E0411 22:45:52.019548 8341 process.cpp:1966] Failed to shutdown socket with fd 24: Transport endpoint is not connected I0411 22:45:52.020465 8336 leveldb.cpp:304] Persisting metadata (8 bytes) to leveldb took 1.421142ms I0411 22:45:52.020496 8336 replica.cpp:342] Persisted promised to 50 I0411 22:46:15.034744 8340 network.hpp:413] ZooKeeper group memberships changed I0411 22:46:15.034867 8334 group.cpp:672] Trying to get '/mesos/log_replicas/0000000346' in ZooKeeper I0411 22:46:15.035729 8334 group.cpp:672] Trying to get '/mesos/log_replicas/0000000347' in ZooKeeper I0411 22:46:15.036533 8334 group.cpp:672] Trying to get '/mesos/log_replicas/0000000348' in ZooKeeper I0411 22:46:15.037353 8335 network.hpp:461] ZooKeeper group PIDs: { log-replica(1)@10.221.28.61:5050, log-replica(1)@10.221.28.249:5050, log-replica(1)@10.221.29.247:5050 } I0411 22:46:27.242632 8336 http.cpp:503] HTTP GET for /master/state.json from 216.145.54.15:54890 with User-Agent='Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:45.0) Gecko/20100101 Firefox/45.0' I0411 22:46:37.292083 8335 http.cpp:503] HTTP GET for /master/state.json from 216.145.54.15:54890 with User-Agent='Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:45.0) Gecko/20100101 Firefox/45.0' I0411 22:46:47.342876 8334 http.cpp:503] HTTP GET for /master/state.json from 216.145.54.15:54890 with User-Agent='Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:45.0) Gecko/20100101 Firefox/45.0' F0411 22:46:52.019045 8333 master.cpp:1457] Recovery failed: Failed to recover registrar: Failed to perform fetch within 1mins *** Check failure stack trace: *** @ 0x7f7ad44badfd (unknown) @ 0x7f7ad44bcc3d (unknown) @ 0x7f7ad44ba9ec (unknown) @ 0x7f7ad44bd539 (unknown) @ 0x7f7ad3cee2dc (unknown) @ 0x7f7ad3d2eab0 (unknown) @ 0x42807e (unknown) @ 0x7f7ad3d550a5 (unknown) @ 0x7f7ad3da7976 (unknown) @ 0x7f7ad3db8566 (unknown) @ 0x7f7ad3be84d6 (unknown) @ 0x7f7ad3db8553 (unknown) @ 0x7f7ad3d9c614 (unknown) @ 0x7f7ad4468971 (unknown) @ 0x7f7ad4468c77 (unknown) @ 0x35282b6470 (unknown) @ 0x35262079d1 (unknown) @ 0x3525ee88fd (unknown) @ (nil) (unknown) /bin/bash: line 1: 8332 Aborted /usr/sbin/mesos-master --work_dir=/tmp/mesos_dir --zk=zk://scheduler1.rpt.cb.ne1.yahoo.com:2181,scheduler2.rpt.cb.ne1.yahoo.com:2181,scheduler3.rpt.cb.ne1.yahoo.com:2181/mesos --quorum=2 > Recovery failed: Failed to recover registrar on reboot of mesos master > ---------------------------------------------------------------------- > > Key: MESOS-5193 > URL: https://issues.apache.org/jira/browse/MESOS-5193 > Project: Mesos > Issue Type: Bug > Components: master > Affects Versions: 0.22.0, 0.27.0 > Reporter: Priyanka Gupta > Labels: master, mesosphere > > Hi all, > We are using a 3 node cluster with mesos master, mesos slave and zookeeper on all of them. We are using chronos on top of it. The problem is when we reboot the mesos master leader, the other nodes try to get elected as leader but fail with recovery registrar issue. > "Recovery failed: Failed to recover registrar: Failed to perform fetch within 1mins" > The next node then try to become the leader but again fails with same error. I am not sure about the issue. We are currently using mesos 0.22 and also tried to upgrade to mesos 0.27 as well but the problem continues to happen. > /usr/sbin/mesos-master --work_dir=/tmp/mesos_dir --zk=zk://node1:2181,node2:2181,node3:2181/mesos --quorum=2 > Can you please help us resolve this issue as its a production system. > Thanks, > Priyanka -- This message was sent by Atlassian JIRA (v6.3.4#6332)