mesos-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chengwei Yang <chengwei.yang...@gmail.com>
Subject Re: Mesos HA does not work (Failed to recover registrar)
Date Tue, 07 Jun 2016 01:15:51 GMT
@Qian,

I think you're running issues with firewall, did you make sure your master can
reach from each other?

FROM master A
$ telnet B 5050

I think it fail to connect.

Please ensure shutdown any firewall.

-- 
Thanks,
Chengwei

On Mon, Jun 06, 2016 at 09:06:43PM +0800, Qian Zhang wrote:
> I deleted everything in the work dir (/var/lib/mesos/master), and tried again,
> the same error still happened :-(
> 
> 
> Thanks,
> Qian Zhang
> 
> On Mon, Jun 6, 2016 at 3:03 AM, Jean Christophe “JC” Martin <
> jch.martin@gmail.com> wrote:
> 
>     Qian,
> 
>     Zookeeper should be able to reach a quorum with 2, no need to start 3
>     simultaneously, but there is an issue with Zookeeper related to connection
>     timeouts.
>     https://issues.apache.org/jira/browse/ZOOKEEPER-2164
>     In some circumstances, the timeout is higher than the sync timeout, which
>     cause the leader election to fail.
>     Try setting the parameter cnxtimeout in zookeeper (by default it’s 5000ms)
>     to the value 500 (500ms). After doing this, leader election in ZK will be
>     super fast even if a node is disconnected.
>    
>     JC
>    
>     > On Jun 4, 2016, at 4:34 PM, Qian Zhang <zhq527725@gmail.com> wrote:
>     >
>     > Thanks Vinod and Dick.
>     >
>     > I think my 3 ZK servers have formed a quorum, each of them has the
>     > following config:
>     >    $ cat conf/zoo.cfg
>     >    server.1=192.168.122.132:2888:3888
>     >    server.2=192.168.122.225:2888:3888
>     >    server.3=192.168.122.171:2888:3888
>     >    autopurge.purgeInterval=6
>     >    autopurge.snapRetainCount=5
>     >    initLimit=10
>     >    syncLimit=5
>     >    maxClientCnxns=0
>     >    clientPort=2181
>     >    tickTime=2000
>     >    quorumListenOnAllIPs=true
>     >    dataDir=/home/stack/packages/zookeeper-3.4.8/snapshot
>     >    dataLogDir=/home/stack/packages/zookeeper-3.4.8/transactions
>     >
>     > And when I run "bin/zkServer.sh status" on each of them, I can see "Mode:
>     > leader" for one, and "Mode: follower" for the other two.
>     >
>     > I have already tried to manually start 3 masters simultaneously, and here
>     > is what I see in their log:
>     > In 192.168.122.171(this is the first master I started):
>     >    I0605 07:12:49.418721  1187 detector.cpp:152] Detected a new leader:
>     > (id='25')
>     >    I0605 07:12:49.419276  1186 group.cpp:698] Trying to get
>     > '/mesos/log_replicas/0000000024' in ZooKeeper
>     >    I0605 07:12:49.420013  1188 group.cpp:698] Trying to get
>     > '/mesos/json.info_0000000025' in ZooKeeper
>     >    I0605 07:12:49.423807  1188 zookeeper.cpp:259] A new leading master
>     > (UPID=master@192.168.122.171:5050) is detected
>     >    I0605 07:12:49.423841 1186 network.hpp:461] ZooKeeper group PIDs: {
>     > log-replica(1)@192.168.122.171:5050 }
>     >    I0605 07:12:49.424281 1187 master.cpp:1951] The newly elected leader
>     > is master@192.168.122.171:5050 with id
>     cdc459d4-a05f-4f99-9bf4-1ee9a91d139b
>     >    I0605 07:12:49.424895  1187 master.cpp:1964] Elected as the leading
>     > master!
>     >
>     > In 192.168.122.225 (second master I started):
>     >    I0605 07:12:51.918702  2246 detector.cpp:152] Detected a new leader:
>     > (id='25')
>     >    I0605 07:12:51.919983  2246 group.cpp:698] Trying to get
>     > '/mesos/json.info_0000000025' in ZooKeeper
>     >    I0605 07:12:51.921910  2249 network.hpp:461] ZooKeeper group PIDs: {
>     > log-replica(1)@192.168.122.171:5050 }
>     >    I0605 07:12:51.925721 2252 replica.cpp:673] Replica in EMPTY status
>     > received a broadcasted recover request from (6)@192.168.122.225:5050
>     >    I0605 07:12:51.927891  2246 zookeeper.cpp:259] A new leading master
>     > (UPID=master@192.168.122.171:5050) is detected
>     >    I0605 07:12:51.928444 2246 master.cpp:1951] The newly elected leader
>     > is master@192.168.122.171:5050 with id
>     cdc459d4-a05f-4f99-9bf4-1ee9a91d139b
>     >
>     > In 192.168.122.132 (last master I started):
>     > I0605 07:12:53.553949 16426 detector.cpp:152] Detected a new leader:
>     > (id='25')
>     > I0605 07:12:53.555179 16429 group.cpp:698] Trying to get
>     > '/mesos/json.info_0000000025' in ZooKeeper
>     > I0605 07:12:53.560045 16428 zookeeper.cpp:259] A new leading master (UPID
>     =
>     > master@192.168.122.171:5050) is detected
>     >
>     > So right after I started these 3 masters, the first one (192.168.122.171)
>     > was successfully elected as leader, but after 60s, 192.168.122.171 failed
>     > with the error mentioned in my first mail, and then 192.168.122.225 was
>     > elected as leader, but it failed with the same error too after another
>     60s,
>     > and the same thing happened to the last one (192.168.122.132). So after
>     > about 180s, all my 3 master were down.
>     >
>     > I tried both:
>     >    sudo ./bin/mesos-master.sh --zk=zk://127.0.0.1:2181/mesos --quorum=2
>     > --work_dir=/var/lib/mesos/master
>     > and
>     >    sudo ./bin/mesos-master.sh --zk=zk://192.168.122.132:2181,
>     > 192.168.122.171:2181,192.168.122.225:2181/mesos --quorum=2
>     > --work_dir=/var/lib/mesos/master
>     > And I see the same error for both.
>     >
>     > 192.168.122.132, 192.168.122.225 and 192.168.122.171 are 3 VMs which are
>     > running on a KVM hypervisor host.
>     >
>     >
>     >
>     >
>     > Thanks,
>     > Qian Zhang
>     >
>     > On Sun, Jun 5, 2016 at 3:47 AM, Dick Davies <dick@hellooperator.net>
>     wrote:
>     >
>     >> You told the master it needed a quorum of 2 and it's the only one
>     >> online, so it's bombing out.
>     >> That's the expected behaviour.
>     >>
>     >> You need to start at least 2 zookeepers before it will be a functional
>     >> group, same for the masters.
>     >>
>     >> You haven't mentioned how you setup your zookeeper cluster, so i'm
>     >> assuming that's working
>     >> correctly (3 nodes, all aware of the other 2 in their config). If not,
>     >> you need to sort that out first.
>     >>
>     >>
>     >> Also I think your zk URL is wrong - you want to list all 3 zookeeper
>     >> nodes like this:
>     >>
>     >> sudo ./bin/mesos-master.sh
>     >> --zk=zk://host1:2181,host2:2181,host3:2181/mesos --quorum=2
>     >> --work_dir=/var/lib/mesos/master
>     >>
>     >> when you've run that command on 2 hosts things should start working,
>     >> you'll want all 3 up for
>     >> redundancy.
>     >>
>     >> On 4 June 2016 at 16:42, Qian Zhang <zhq527725@gmail.com> wrote:
>     >>> Hi Folks,
>     >>>
>     >>> I am trying to set up a Mesos HA env with 3 nodes, each of nodes has
a
>     >>> Zookeeper running, so they form a Zookeeper cluster. And then when I
>     >> started
>     >>> the first Mesos master in one node with:
>     >>>    sudo ./bin/mesos-master.sh --zk=zk://127.0.0.1:2181/mesos --quorum=2
>     >>> --work_dir=/var/lib/mesos/master
>     >>>
>     >>> I found it will hang here for 60 seconds:
>     >>>  I0604 23:39:56.488219 15330 zookeeper.cpp:259] A new leading master
>     >>> (UPID=master@192.168.122.132:5050) is detected
>     >>>  I0604 23:39:56.489080 15337 master.cpp:1951] The newly elected leader
>     >> is
>     >>> master@192.168.122.132:5050 with id
>     40d387a6-4d61-49d6-af44-51dd41457390
>     >>>  I0604 23:39:56.489791 15337 master.cpp:1964] Elected as the leading
>     >>> master!
>     >>>  I0604 23:39:56.490401 15337 master.cpp:1651] Recovering from registrar
>     >>>  I0604 23:39:56.491706 15330 registrar.cpp:332] Recovering registrar
>     >>>  I0604 23:39:56.496448 15332 log.cpp:524] Attempting to start the
>     writer
>     >>>
>     >>> And after 60s, master will fail:
>     >>> F0604 23:40:56.499596 15337 master.cpp:1640] Recovery failed: Failed
to
>     >>> recover registrar: Failed to perform fetch within 1mins
>     >>> *** Check failure stack trace: ***
>     >>>    @     0x7f4b81372f4e  google::LogMessage::Fail()
>     >>>    @     0x7f4b81372e9a  google::LogMessage::SendToLog()
>     >>>    @     0x7f4b8137289c  google::LogMessage::Flush()
>     >>>    @     0x7f4b813757b0  google::LogMessageFatal::~LogMessageFatal()
>     >>>    @     0x7f4b8040eea0  mesos::internal::master::fail()
>     >>>    @     0x7f4b804dbeb3
>     >>>
>     >>
>     _ZNSt5_BindIFPFvRKSsS1_EPKcSt12_PlaceholderILi1EEEE6__callIvJS1_EJLm0ELm1EEEET_OSt5tupleIJDpT0_EESt12_Index_tupleIJXspT1_EEE
>     >>>    @     0x7f4b804ba453
>     >>> _ZNSt5_BindIFPFvRKSsS1_EPKcSt12_PlaceholderILi1EEEEclIJS1_EvEET0_DpOT_
>     >>>    @     0x7f4b804898d7
>     >>>
>     >>
>     _ZZNK7process6FutureI7NothingE8onFailedISt5_BindIFPFvRKSsS6_EPKcSt12_PlaceholderILi1EEEEvEERKS2_OT_NS2_6PreferEENUlS6_E_clES6_
>     >>>    @     0x7f4b804dbf80
>     >>>
>     >>
>     _ZNSt17_Function_handlerIFvRKSsEZNK7process6FutureI7NothingE8onFailedISt5_BindIFPFvS1_S1_EPKcSt12_PlaceholderILi1EEEEvEERKS6_OT_NS6_6PreferEEUlS1_E_E9_M_invokeERKSt9_Any_dataS1_
>     >>>    @           0x49d257  std::function<>::operator()()
>     >>>    @           0x49837f
>     >>>
>     >>
>     _ZN7process8internal3runISt8functionIFvRKSsEEJS4_EEEvRKSt6vectorIT_SaIS8_EEDpOT0_
>     >>>    @           0x493024  process::Future<>::fail()
>     >>>    @     0x7f4b8015ad20  process::Promise<>::fail()
>     >>>    @     0x7f4b804d9295  process::internal::thenf<>()
>     >>>    @     0x7f4b8051788f
>     >>>
>     >>
>     _ZNSt5_BindIFPFvRKSt8functionIFN7process6FutureI7NothingEERKN5mesos8internal8RegistryEEERKSt10shared_ptrINS1_7PromiseIS3_EEERKNS2_IS7_EEESB_SH_St12_PlaceholderILi1EEEE6__callIvISM_EILm0ELm1ELm2EEEET_OSt5tupleIIDpT0_EESt12_Index_tupleIIXspT1_EEE
>     >>>    @     0x7f4b8050fa3b  std::_Bind<>::operator()<>()
>     >>>    @     0x7f4b804f94e3  std::_Function_handler<>::_M_invoke()
>     >>>    @     0x7f4b8050fc69  std::function<>::operator()()
>     >>>    @     0x7f4b804f9609
>     >>>
>     >>
>     _ZZNK7process6FutureIN5mesos8internal8RegistryEE5onAnyIRSt8functionIFvRKS4_EEvEES8_OT_NS4_6PreferEENUlS8_E_clES8_
>     >>>    @     0x7f4b80517936
>     >>>
>     >>
>     _ZNSt17_Function_handlerIFvRKN7process6FutureIN5mesos8internal8RegistryEEEEZNKS5_5onAnyIRSt8functionIS8_EvEES7_OT_NS5_6PreferEEUlS7_E_E9_M_invokeERKSt9_Any_dataS7_
>     >>>    @     0x7f4b8050fc69  std::function<>::operator()()
>     >>>    @     0x7f4b8056b1b4  process::internal::run<>()
>     >>>    @     0x7f4b80561672  process::Future<>::fail()
>     >>>    @     0x7f4b8059bf5f  std::_Mem_fn<>::operator()<>()
>     >>>    @     0x7f4b8059757f
>     >>>
>     >>
>     _ZNSt5_BindIFSt7_Mem_fnIMN7process6FutureIN5mesos8internal8RegistryEEEFbRKSsEES6_St12_PlaceholderILi1EEEE6__callIbIS8_EILm0ELm1EEEET_OSt5tupleIIDpT0_EESt12_Index_tupleIIXspT1_EEE
>     >>>    @     0x7f4b8058fad1
>     >>>
>     >>
>     _ZNSt5_BindIFSt7_Mem_fnIMN7process6FutureIN5mesos8internal8RegistryEEEFbRKSsEES6_St12_PlaceholderILi1EEEEclIJS8_EbEET0_DpOT_
>     >>>    @     0x7f4b80585a41
>     >>>
>     >>
>     _ZZNK7process6FutureIN5mesos8internal8RegistryEE8onFailedISt5_BindIFSt7_Mem_fnIMS4_FbRKSsEES4_St12_PlaceholderILi1EEEEbEERKS4_OT_NS4_6PreferEENUlS9_E_clES9_
>     >>>    @     0x7f4b80597605
>     >>>
>     >>
>     _ZNSt17_Function_handlerIFvRKSsEZNK7process6FutureIN5mesos8internal8RegistryEE8onFailedISt5_BindIFSt7_Mem_fnIMS8_FbS1_EES8_St12_PlaceholderILi1EEEEbEERKS8_OT_NS8_6PreferEEUlS1_E_E9_M_invokeERKSt9_Any_dataS1_
>     >>>    @           0x49d257  std::function<>::operator()()
>     >>>    @           0x49837f
>     >>>
>     >>
>     _ZN7process8internal3runISt8functionIFvRKSsEEJS4_EEEvRKSt6vectorIT_SaIS8_EEDpOT0_
>     >>>    @     0x7f4b8056164a  process::Future<>::fail()
>     >>>    @     0x7f4b8055a378  process::Promise<>::fail()
>     >>>
>     >>> I tried both Zookeeper 3.4.8 and 3.4.6 with latest code of Mesos, but
>     no
>     >>> luck for both. Any ideas about what happened? Thanks.
>     >>>
>     >>>
>     >>>
>     >>> Thanks,
>     >>> Qian Zhang
>     >>
> 
> 
> 
> SECURITY NOTE: file ~/.netrc must not be accessible by others

Mime
View raw message