Return-Path: X-Original-To: apmail-mesos-user-archive@www.apache.org Delivered-To: apmail-mesos-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 5B74B18289 for ; Fri, 14 Aug 2015 12:53:42 +0000 (UTC) Received: (qmail 50398 invoked by uid 500); 14 Aug 2015 12:53:42 -0000 Delivered-To: apmail-mesos-user-archive@mesos.apache.org Received: (qmail 50347 invoked by uid 500); 14 Aug 2015 12:53:41 -0000 Mailing-List: contact user-help@mesos.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@mesos.apache.org Delivered-To: mailing list user@mesos.apache.org Received: (qmail 50337 invoked by uid 99); 14 Aug 2015 12:53:41 -0000 Received: from Unknown (HELO spamd4-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 14 Aug 2015 12:53:41 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd4-us-west.apache.org (ASF Mail Server at spamd4-us-west.apache.org) with ESMTP id 68599C1306 for ; Fri, 14 Aug 2015 12:53:41 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd4-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 3.181 X-Spam-Level: *** X-Spam-Status: No, score=3.181 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=3, NORMAL_HTTP_TO_IP=0.001, PLING_QUERY=0.279, SPF_PASS=-0.001, URIBL_BLOCKED=0.001, WEIRD_PORT=0.001] autolearn=disabled Authentication-Results: spamd4-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-us-east.apache.org ([10.40.0.8]) by localhost (spamd4-us-west.apache.org [10.40.0.11]) (amavisd-new, port 10024) with ESMTP id U77NvZ2k1tbx for ; Fri, 14 Aug 2015 12:53:26 +0000 (UTC) Received: from mail-oi0-f49.google.com (mail-oi0-f49.google.com [209.85.218.49]) by mx1-us-east.apache.org (ASF Mail Server at mx1-us-east.apache.org) with ESMTPS id 07BFD42B29 for ; Fri, 14 Aug 2015 12:53:26 +0000 (UTC) Received: by oiev193 with SMTP id v193so43057626oie.3 for ; Fri, 14 Aug 2015 05:53:25 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=GsMoeNRlDnB+HZstDyA24T7A3xRr38GpJtW97Z+jh3s=; b=lKSGnaA2H3fP1zDTeovqapQMOhJbt5rVHRpRuREDOe/2b4p4CqhOq4vNcpQIEOJbwl 2neauebZmRlNZVdTyzU26J8Prxy2T5f7sPjCuQyUBgPouC44w6REZRZcz7IXiYPt5tCP SgQOeNhUwX6exUdpNaemL+rM4tVrg1ES5ScAsmWlN85XGRWQmLY8v6zIzIkCSPBNny4h dzrIgcG2b9L9ZYkZC2FPgzZiViAXcsEW25BzOnUyKYFsk/h/XhaEa2Wd3oGQbGBKdejJ kdM7J4FungGb/vR3Ui74DIxZinb/2eugMu5r8xRdSccWNmegz4G0LHSiYg3hE6sU7QIn WKbA== MIME-Version: 1.0 X-Received: by 10.202.209.3 with SMTP id i3mr34024366oig.109.1439556805490; Fri, 14 Aug 2015 05:53:25 -0700 (PDT) Received: by 10.76.178.105 with HTTP; Fri, 14 Aug 2015 05:53:25 -0700 (PDT) In-Reply-To: References: Date: Fri, 14 Aug 2015 08:53:25 -0400 Message-ID: Subject: Re: Can't start master properly (stale state issue?); help! From: Paul Bell To: user@mesos.apache.org Content-Type: multipart/alternative; boundary=001a113d2d04068158051d44f0df --001a113d2d04068158051d44f0df Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable All, By way of some background: I'm not running a data center (or centers). Rather, I work on a distributed application whose trajectory is taking it into a realm of many Docker containers distributed across many hosts (mostly virtual hosts at the outset). An environment that supports isolation, multi-tenancy, scalability, and some fault tolerance is desirable for this application. Also, the mere ability to simplify - at least somewhat - the management of multiple hosts is of great importance. So, that's more or less how I got to Mesos and to here... I ended up writing a Java program that configures a collection of host VMs as a Mesos cluster and then, via Marathon, distributes the application containers across the cluster. Configuring & building the cluster is largely a lot of SSH work. Doing the same for the application is part Marathon, part Docker remote API. The containers that need to talk to each other via TCP are connected with Weave's (http://weave.works) overlay network. So the main infrastructure consists of Mesos, Docker, and Weave. The whole thing is pretty amazing - for which I take very little credit. Rather, these are some wonderful technologies, and the folks who write & support them are very helpful. That said, I sometimes feel like I'm juggling chain saws! *In re* the issues raised on this thread: All Mesos components were installed via the Mesosphere packages. The 4 VMs in the cluster are all running Ubuntu 14.04 LTS. My suspicions about the IP@ 127.0.1.1 were raised a few months ago when, after seeing this IP in a mesos-master log when things "weren't working", I discovered these articles: https://groups.google.com/forum/#!topic/marathon-framework/1qboeZTOLU4 *http://frankhinek.com/build-mesos-multi-node-ha-cluster/ * (s= ee "note 2") So, to the point raised just now by Klaus (and earlier in the thread), the aforementioned configuration program does change /etc/hosts (and /etc/hostname) in the way Klaus suggested. But, as I mentioned to Marco & hasodent, I might have encountered a race condition wherein ZK & mesos-master saw the unchanged /etc/hosts before I altered it. I believe that I yesterday fixed that issue. Also, as part of the "cluster create" step, I get a bit aggressive (perhaps unwisely) with what I believe are some state repositories. Specifically, I rm /var/lib/zookeeper/version-2/* rm -Rf /var/lib/mesos/replicated_log Should I NOT be doing this? I know from experience that zapping the "version-2" directory (ZK's data_Dir if IIRC) can solve occasional weirdness. Marco is "/var/lib/mesos/replicated_log" what you are referring to when you say some "issue with the log-replica"? Just a day or two ago I first heard the term "znode" & learned a little about zkCli.sh. I will experiment with it more in the coming days. As matters now stand, I have the cluster up and running. But before I again deploy the application, I am trying to put the cluster through its paces by periodically cycling it through the states my program can bring about, e.g., --cluster create (takes a clean VM and configures it to act as one or more Mesos components: ZK, master, slave) --cluster stop (stops the Mesos services on each node) --cluster destroy (configures the VM back to its original clean state= ) --cluster create --cluster stop --cluster start et cetera. *The only way I got rid of the "no leading master" issue that started this thread was by wiping out the master VM and starting over with a clean VM. That is, stopping/destroying/creating (even rebooting) the cluster had no effect.* I suspect that, sooner or later, I will again hit this problem (probably sooner!). And I want to understand how best to handle it. Such an occurrence could be pretty awkward at a customer site. Thanks for all your help. Cordially, Paul On Thu, Aug 13, 2015 at 9:41 PM, Klaus Ma wrote: > I used to meet a similar issue with Zookeeper + Messo; I resolved it by > remove 127.0.1.1 from /etc/hosts; here is an example: > klaus@klaus-OptiPlex-780:~/Workspace/mesos$ cat /etc/hosts > 127.0.0.1 localhost > 127.0.1.1 klaus-OptiPlex-780 *<<=3D=3D=3D=3D=3D remove this line,= and a new > line: mapping IP (e.g. 192.168.1.100) with hostname* > ... > > BTW, please also clear-up the log directory and re-start ZK & Mesos. > > If any more comments, please let me know. > > Regards, > ---- > Klaus Ma (=E9=A9=AC=E8=BE=BE), PMP=C2=AE | http://www.cguru.net > > Call > Send SMS > Call from mobile > Add to Skype > You'll need Skype CreditFree via Skype > Call > Send SMS > Call from mobile > Add to Skype > You'll need Skype CreditFree via Skype > ------------------------------ > Date: Thu, 13 Aug 2015 12:20:34 -0700 > Subject: Re: Can't start master properly (stale state issue?); help! > From: marco@mesosphere.io > To: user@mesos.apache.org > > > > On Thu, Aug 13, 2015 at 11:53 AM, Paul Bell wrote: > > Marco & hasodent, > > This is just a quick note to say thank you for your replies. > > No problem, you're welcome. > > > I will answer you much more fully tomorrow, but for now can only manage a > few quick observations & questions: > > 1. Having some months ago encountered a known problem with the IP@ > 127.0.1.1 (I'll provide references tomorrow), I early on configured > /etc/hosts, replacing "myHostName 127.0.1.1" with "myHostName ". > That said, I can't rule out a race condition whereby ZK | mesos-master sa= w > the original unchanged /etc/hosts before I zapped it. > > 2. What is a znode and how would I drop it? > > so, the znode is the fancy name that ZK gives to the nodes in its tree > (trivially, the "path") - assuming that you give Mesos the following ZK U= RL: > zk://10.10.0.5:2181/mesos/prod > > the 'znode' would be `/mesos/prod` and you could go inspect it (using > zkCli.sh) by doing: > > ls /mesos/prod > > you should see at least one (with the Master running) file: info_0000001 > or json.info_00000001 (depending on whether you're running 0.23 or 0.24) > and you could then inspect its contents with: > > get /mesos/prod/info_0000001 > > For example, if I run a Mesos 0.23 on my localhost, against ZK on the sam= e: > > $ ./bin/mesos-master.sh --zk=3Dzk://localhost:2181/mesos/test --quorum=3D= 1 > --work_dir=3D/tmp/m23-2 --port=3D5053 > I can connect to ZK via zkCli.sh and: > > [zk: localhost:2181(CONNECTED) 4] ls /mesos/test > [info_0000000006, log_replicas] > [zk: localhost:2181(CONNECTED) 6] get /mesos/test/info_0000000006 > #20150813-120952-18983104-5053-14072=D1=86 '"master@192.168.33.1:5053 > *.... 192.168.33.120.23.0 > > cZxid =3D 0x314 > dataLength =3D 93 > .... // a bunch of other metadata > numChildren =3D 0 > > (you can remove it with - you guessed it - `rm -f /mesos/test` at the CLI > prompt - stop Mesos first, or it will be a very unhappy Master :). > in the corresponding logs I see (note the "new leader" here too, even > though this was the one and only): > > I0813 12:09:52.126509 105455616 group.cpp:656] Trying to get > '/mesos/test/info_0000000006' in ZooKeeper > W0813 12:09:52.127071 107065344 detector.cpp:444] Leading master > master@192.168.33.1:5053 is using a Protobuf > binary format when registering with ZooKeeper (info): this will be > deprecated as of Mesos 0.24 (see MESOS-2340) > I0813 12:09:52.127094 107065344 detector.cpp:481] A new leading master > (UPID=3Dmaster@192.168.33.1:5053 ) is detected > I0813 12:09:52.127187 103845888 master.cpp:1481] The newly elected leader > is master@192.168.33.1:5053 with id > 20150813-120952-18983104-5053-14072 > I0813 12:09:52.127209 103845888 master.cpp:1494] Elected as the leading > master! > > > At this point, I'm almost sure you're running up against some issue with > the log-replica; but I'm the least competent guy here to help you on that > one, hopefully someone else will be able to add insight here. > > I start the services (zk, master, marathon; all on same host) by SSHing > into the host & doing "service XXXX start" commands. > > Again, thanks very much; and more tomorrow. > > Cordially, > > Paul > > On Thu, Aug 13, 2015 at 1:08 PM, haosdent wrote: > > Hello, how you start the master? And could you try use "netstat -antp|gre= p > 5050" to find whether there are multi master processes run at a same > machine or not? > > On Thu, Aug 13, 2015 at 10:37 PM, Paul Bell wrote: > > Hi All, > > I hope someone can shed some light on this because I'm getting desperate! > > I try to start components zk, mesos-master, and marathon in that order. > They are started via a program that SSHs to the sole host and does "servi= ce > xxx start". Everyone starts happily enough. But the Mesos UI shows me: > > *This master is not the leader, redirecting in 0 seconds ... go now* > > The pattern seen in all of the mesos-master.INFO logs (one of which shown > below) is that the mesos-master with the correct IP@ starts. But then a > "new leader" is detected and becomes leading master. This new leader show= s > UPID *(UPID=3Dmaster@127.0.1.1:5050 * > > I've tried clearing what ZK and mesos-master state I can find, but this > problem will not "go away". > > Would someone be so kind as to a) explain what is happening here and b) > suggest remedies? > > Thanks very much. > > -Paul > > > Log file created at: 2015/08/13 10:19:43 > Running on machine: 71.100.14.9 > Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg > I0813 10:19:43.225636 2542 logging.cpp:172] INFO level logging started! > I0813 10:19:43.235213 2542 main.cpp:181] Build: 2015-05-05 06:15:50 by > root > I0813 10:19:43.235244 2542 main.cpp:183] Version: 0.22.1 > I0813 10:19:43.235257 2542 main.cpp:186] Git tag: 0.22.1 > I0813 10:19:43.235268 2542 main.cpp:190] Git SHA: > d6309f92a7f9af3ab61a878403e3d9c284ea87e0 > I0813 10:19:43.245098 2542 leveldb.cpp:176] Opened db in 9.386828ms > I0813 10:19:43.247138 2542 leveldb.cpp:183] Compacted db in 1.956669ms > I0813 10:19:43.247194 2542 leveldb.cpp:198] Created db iterator in 13961= ns > I0813 10:19:43.247206 2542 leveldb.cpp:204] Seeked to beginning of db in > 677ns > I0813 10:19:43.247215 2542 leveldb.cpp:273] Iterated through 0 keys in > the db in 243ns > I0813 10:19:43.247252 2542 replica.cpp:744] Replica recovered with log > positions 0 -> 0 with 1 holes and 0 unlearned > I0813 10:19:43.248755 2611 log.cpp:238] Attempting to join replica to > ZooKeeper group > I0813 10:19:43.248924 2542 main.cpp:306] Starting Mesos master > I0813 10:19:43.249244 2612 recover.cpp:449] Starting replica recovery > I0813 10:19:43.250239 2612 recover.cpp:475] Replica is in EMPTY status > I0813 10:19:43.250819 2612 replica.cpp:641] Replica in EMPTY status > received a broadcasted recover request > I0813 10:19:43.251014 2607 recover.cpp:195] Received a recover response > from a replica in EMPTY status > *I0813 10:19:43.249503 2542 master.cpp:349] Master > 20150813-101943-151938119-5050-2542 (71.100.14.9) started on > 71.100.14.9:5050 * > I0813 10:19:43.252053 2610 recover.cpp:566] Updating replica status to > STARTING > I0813 10:19:43.252571 2542 master.cpp:397] Master allowing > unauthenticated frameworks to register > I0813 10:19:43.253159 2542 master.cpp:402] Master allowing > unauthenticated slaves to register > I0813 10:19:43.254276 2612 leveldb.cpp:306] Persisting metadata (8 bytes= ) > to leveldb took 1.816161ms > I0813 10:19:43.254323 2612 replica.cpp:323] Persisted replica status to > STARTING > I0813 10:19:43.254905 2612 recover.cpp:475] Replica is in STARTING statu= s > I0813 10:19:43.255203 2612 replica.cpp:641] Replica in STARTING status > received a broadcasted recover request > I0813 10:19:43.255265 2612 recover.cpp:195] Received a recover response > from a replica in STARTING status > I0813 10:19:43.255343 2612 recover.cpp:566] Updating replica status to > VOTING > I0813 10:19:43.258730 2611 master.cpp:1295] Successfully attached file > '/var/log/mesos/mesos-master.INFO' > I0813 10:19:43.258760 2609 contender.cpp:131] Joining the ZK group > I0813 10:19:43.258862 2612 leveldb.cpp:306] Persisting metadata (8 bytes= ) > to leveldb took 3.477458ms > I0813 10:19:43.258894 2612 replica.cpp:323] Persisted replica status to > VOTING > I0813 10:19:43.258934 2612 recover.cpp:580] Successfully joined the Paxo= s > group > I0813 10:19:43.258987 2612 recover.cpp:464] Recover process terminated > I0813 10:19:46.590340 2606 group.cpp:313] Group process (group(1)@ > 71.100.14.9:5050) connected to ZooKeeper > I0813 10:19:46.590373 2606 group.cpp:790] Syncing group operations: queu= e > size (joins, cancels, datas) =3D (0, 0, 0) > I0813 10:19:46.590386 2606 group.cpp:385] Trying to create path > '/mesos/log_replicas' in ZooKeeper > I0813 10:19:46.591442 2606 network.hpp:424] ZooKeeper group memberships > changed > I0813 10:19:46.591514 2606 group.cpp:659] Trying to get > '/mesos/log_replicas/0000000000' in ZooKeeper > I0813 10:19:46.592146 2606 group.cpp:659] Trying to get > '/mesos/log_replicas/0000000001' in ZooKeeper > I0813 10:19:46.593128 2608 network.hpp:466] ZooKeeper group PIDs: { > log-replica(1)@127.0.1.1:5050 } > I0813 10:19:46.593955 2608 group.cpp:313] Group process (group(2)@ > 71.100.14.9:5050) connected to ZooKeeper > I0813 10:19:46.593977 2608 group.cpp:790] Syncing group operations: queu= e > size (joins, cancels, datas) =3D (1, 0, 0) > I0813 10:19:46.593986 2608 group.cpp:385] Trying to create path > '/mesos/log_replicas' in ZooKeeper > I0813 10:19:46.594894 2605 group.cpp:313] Group process (group(3)@ > 71.100.14.9:5050) connected to ZooKeeper > I0813 10:19:46.594992 2605 group.cpp:790] Syncing group operations: queu= e > size (joins, cancels, datas) =3D (1, 0, 0) > I0813 10:19:46.595007 2605 group.cpp:385] Trying to create path '/mesos' > in ZooKeeper > I0813 10:19:46.595654 2607 group.cpp:313] Group process (group(4)@ > 71.100.14.9:5050) connected to ZooKeeper > I0813 10:19:46.595741 2607 group.cpp:790] Syncing group operations: queu= e > size (joins, cancels, datas) =3D (0, 0, 0) > I0813 10:19:46.595785 2607 group.cpp:385] Trying to create path '/mesos' > in ZooKeeper > I0813 10:19:46.598635 2612 network.hpp:424] ZooKeeper group memberships > changed > I0813 10:19:46.598775 2612 group.cpp:659] Trying to get > '/mesos/log_replicas/0000000000' in ZooKeeper > I0813 10:19:46.599954 2612 group.cpp:659] Trying to get > '/mesos/log_replicas/0000000001' in ZooKeeper > I0813 10:19:46.600307 2611 contender.cpp:247] New candidate (id=3D'3') h= as > entered the contest for leadership > I0813 10:19:46.600721 2612 group.cpp:659] Trying to get > '/mesos/log_replicas/0000000002' in ZooKeeper > I0813 10:19:46.601297 2612 network.hpp:466] ZooKeeper group PIDs: { > log-replica(1)@127.0.1.1:5050, log-replica(1)@71.100.14.9:5050 } > I0813 10:19:46.601752 2607 detector.cpp:138] Detected a new leader: > (id=3D'0') > I0813 10:19:46.601850 2611 group.cpp:659] Trying to get > '/mesos/info_0000000000' in ZooKeeper > *I0813 10:19:46.602330 2611 detector.cpp:452] A new leading master > (UPID=3Dmaster@127.0.1.1:5050 ) is detected* > *I0813 10:19:46.602412 2607 master.cpp:1356] The newly elected leader is > master@127.0.1.1:5050 with id > 20150813-101601-16842879-5050-6368* > I0813 10:19:58.542353 2611 http.cpp:516] HTTP request for > '/master/state.json' > I0813 10:19:59.457691 2612 http.cpp:516] HTTP request for > '/master/state.json' > I0813 10:20:00.355845 2606 http.cpp:516] HTTP request for > '/master/state.json' > I0813 10:20:06.577448 2609 http.cpp:352] HTTP request for > '/master/redirect' > > > > > -- > Best Regards, > Haosdent Huang > > > > --001a113d2d04068158051d44f0df Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
All,

By way of some backgrou= nd: I'm not running a data center (or centers). Rather, I work on a dis= tributed application whose trajectory is taking it into a realm of many Doc= ker containers distributed across many hosts (mostly virtual hosts at the o= utset). An environment that supports isolation, multi-tenancy, scalability,= and some fault tolerance is desirable for this application. Also, the mere= ability to simplify - at least somewhat - the management of multiple hosts= is of great importance. So, that's more or less how I got to Mesos and= to here...

I ended up writing a Java program = that configures a collection of host VMs as a Mesos cluster and then, via M= arathon, distributes the application containers across the cluster. Configu= ring & building the cluster is largely a lot of SSH work. Doing the sam= e for the application is part Marathon, part Docker remote API. The contain= ers that need to talk to each other via TCP are connected with Weave's = (http://weave.works) overlay network. So= the main infrastructure consists of Mesos, Docker, and Weave. The whole th= ing is pretty amazing - for which I take very little credit. Rather, these = are some wonderful technologies, and the folks who write & support them= are very helpful. That said, I sometimes feel like I'm juggling chain = saws!

In re=C2=A0the issues raised on this = thread:

All Mesos components were installed via th= e Mesosphere packages. The 4 VMs in the cluster are all running Ubuntu 14.0= 4 LTS.

My suspicions about the IP@ 127.0.1.1 were = raised a few months ago when, after seeing this IP in a mesos-master log wh= en things "weren't working", I discovered these articles:


So, to the point raised just now by Klaus (and earlier in the thread), th= e aforementioned configuration program does change /etc/hosts (and /etc/hos= tname) in the way Klaus suggested. But, as I mentioned to Marco & hasod= ent, I might have encountered a race condition wherein ZK & mesos-maste= r saw the unchanged /etc/hosts before I altered it. I believe that I yester= day fixed that issue.

Also, as part of the &= quot;cluster create" step, I get a bit aggressive (perhaps unwisely) w= ith what I believe are some state repositories. Specifically, I

rm /var/lib/zookeeper/version-2/*
rm -Rf /var/lib/me= sos/replicated_log

Should I NOT be doing t= his? I know from experience that zapping the "version-2" director= y (ZK's data_Dir if IIRC) =C2=A0can solve occasional weirdness. Marco i= s "/var/lib/mesos/replicated_log" what you are referring to when = you say some "issue with the log-replica"?

Jus= t a day or two ago I first heard the term "znode" & learned a= little about zkCli.sh. I will experiment with it more in the coming days.<= br>

As matters now stand, I have the cluster up and runn= ing. But before I again deploy the application, I am trying to put the clus= ter through its paces by periodically cycling it through the states my prog= ram can bring about, e.g.,=C2=A0

--cluster create =C2=A0 =C2= =A0 =C2=A0 =C2=A0 (takes a clean VM and configures it to act as one or more= Mesos components: ZK, master, slave)
--cluster stop =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0(stops the Mesos services on each node)
--cluster destroy =C2=A0 =C2=A0 =C2=A0 (configures the VM back to its= original clean state)
--cluster create
--cluster stop = =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0
--cluster start

et cetera.

The only way= I got rid of the "no leading master" issue that started this thr= ead was by wiping out the master VM and starting over with a clean VM. That= is, stopping/destroying/creating (even rebooting) the cluster had no effec= t.

I suspect that, sooner or later, I will again= hit this problem (probably sooner!). And I want to understand how best to = handle it. Such an occurrence could be pretty awkward at a customer site.

Thanks for all your help.

= Cordially,

Paul


On Thu, Aug 1= 3, 2015 at 9:41 PM, Klaus Ma <klaus@cguru.net> wrote:
I used to meet a similar issue with Zookeeper + Messo; I resolved it= by remove 127.0.1.1 from /etc/hosts; here is an example:
=C2=A0=C2=A0= =C2=A0 klaus@klaus-OptiPlex-780:~/Workspace/mesos$ cat /etc/hosts
=C2=A0= =C2=A0=C2=A0 127.0.0.1=C2=A0=C2=A0 localhost
=C2=A0=C2=A0=C2=A0 127.0.1.= 1=C2=A0=C2=A0 klaus-OptiPlex-780=C2=A0=C2=A0 <<=3D=3D=3D=3D=3D rem= ove this line, and a new line: mapping IP (e.g. 192.168.1.100) with hostnam= e
=C2=A0=C2=A0=C2=A0 ...

BTW, please also clear-up the log directory and re-start ZK &a= mp; Mesos.

If any m= ore comments, please let me know.

Regards,
=
----
Klaus Ma (=E9=A9=AC=E8=BE=BE), PMP=C2= =AE | http://www.cguru.n= et

Add to Skype
You'll need Skype CreditFree via Skype<= /div>
You'll = need Skype CreditFree via Skype

Dat= e: Thu, 13 Aug 2015 12:20:34 -0700
Subject: Re: Can't start master p= roperly (stale state issue?); help!
From: marco@mesosphere.io
To: user@mesos.apache.org



On Thu, Aug 13, 2015 = at 11:53 AM, Paul Bell <arachweb@gmail.com> wrote:
Marco & hasodent,

This is just a quick note to sa= y thank you for your replies.

N= o problem, you're welcome.
=C2=A0
I will answer you much more fully tomorrow, but for now can only manage a= few quick observations & questions:

1. Having= some months ago encountered a known problem with the IP@ 127.0.1.1 (I'= ll provide references tomorrow), I early on configured /etc/hosts, replacin= g "myHostName 127.0.1.1" with "myHostName <Real_IP>&qu= ot;. That said, I can't rule out a race condition whereby ZK | mesos-ma= ster saw the original unchanged /etc/hosts before I zapped it.
2. What is a znode and how would I drop it?

so, the znode is the fancy name that ZK gives to = the nodes in its tree (trivially, the "path") - assuming that you= give Mesos the following ZK URL:

the 'znode' would be `/mesos/prod` and you could g= o inspect it (using zkCli.sh) by doing:
> ls /mesos/prod
=

you should see at least one (with the Master running) f= ile: info_0000001 or json.info_00000001 (depending on whether you're ru= nning 0.23 or 0.24) and you could then inspect its contents with:
> get /mesos/prod/info_0000001

For example, if= I run a Mesos 0.23 on my localhost, against ZK on the same:

$ ./bin/mesos-master.sh --zk= =3Dzk://localhost:2181/mesos/test --quorum=3D1 --work_dir=3D/tmp/m23-2 --po= rt=3D5053

I can connect to ZK via zkCli.sh and:

[zk: localhost:2181(CONNECTED) 4= ] ls /mesos/test
[info_000000= 0006, log_replicas]
[zk: loca= lhost:2181(CONNECTED) 6] get /mesos/test/info_0000000006
#20150813-120952-18983104-5053-14072=D1=86 '= ;"master@192.168.33.1:5053*....=C2=A0192.168.33.120.23.0
cZxid =3D 0x314
dataLength =3D 93
.... // a bunch of other metadata
numChildren =3D 0

(you can remove it with - you guessed it - `rm -f /mesos/test` at th= e CLI prompt - stop Mesos first, or it will be a very unhappy Master :).
in the corresponding logs I see (note the "new leader&quo= t; here too, even though this was the one and only):

I0813 12:09:52.126509 105455616 group.cpp= :656] Trying to get '/mesos/test/info_0000000006' in ZooKeeper
W= 0813 12:09:52.127071 107065344 detector.cpp:444] Leading master master@192.168.33.1:5053 is= using a Protobuf binary format when registering with ZooKeeper (info): thi= s will be deprecated as of Mesos 0.24 (see MESOS-2340)
I0813 12:09:52.12= 7094 107065344 detector.cpp:481] A new leading master (UPID=3Dmaster@192.168.33.1:5053) is = detected
I0813 12:09:52.127187 103845888 master.cpp:1481] The newly elec= ted leader is master= @192.168.33.1:5053 with id 20150813-120952-18983104-5053-14072
I0813= 12:09:52.127209 103845888 master.cpp:1494] Elected as the leading master!<= /font>


At this point, I'm almost sure= you're running up against some issue with the log-replica; but I'm= the least competent guy here to help you on that one, hopefully someone el= se will be able to add insight here.

I start the services (zk, master, marathon; all on same host) by SSH= ing into the host & doing "service XXXX start" commands.

Again, thanks very much; and more tomorrow.
=
Cordially,

Paul

On Thu, Aug 13, 2015 at 1:08 PM, haosdent <haosdent@gmail= .com> wrote:
Hello, how you start the master? And = could you try use "netstat -antp|grep 5050" to find whether there= are multi master processes run at a same machine or not?

On Thu, Aug 13, 2015 at 10:37 PM, Paul Bell &= lt;arachweb@gmail.c= om> wrote:
Hi All,

I hope someo= ne can shed some light on this because I'm getting desperate!

I try to start components zk, mesos-master, and marathon in= that order. They are started via a program that SSHs to the sole host and = does "service xxx start". Everyone starts happily enough. But the= Mesos UI shows me:

This master is not = the leader, redirecting in 0 seconds ... go now

The pattern seen in all of the mesos-master.INFO logs (one of whic= h shown below) is that the mesos-master with the correct IP@ starts. But th= en a "new leader" is detected and becomes leading master. This ne= w leader shows UPID=C2=A0(UPID=3Dmaster@127.0.1.1:5050

= I've tried clearing what ZK and mesos-master state I can find, but this= problem will not "go away".

Would someo= ne be so kind as to a) explain what is happening here and b) suggest remedi= es?

Thanks very much.

-Pa= ul


Log file created at: 2015/0= 8/13 10:19:43
Running on machine: 71.100.14.9
Log line = format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg
I0813 = 10:19:43.225636 =C2=A02542 logging.cpp:172] INFO level logging started!
I0813 10:19:43.235213 =C2=A02542 main.cpp:181] Build: 2015-05-05 06:= 15:50 by root
I0813 10:19:43.235244 =C2=A02542 main.cpp:183] Vers= ion: 0.22.1
I0813 10:19:43.235257 =C2=A02542 main.cpp:186] Git ta= g: 0.22.1
I0813 10:19:43.235268 =C2=A02542 main.cpp:190] Git SHA:= d6309f92a7f9af3ab61a878403e3d9c284ea87e0
I0813 10:19:43.245098 = =C2=A02542 leveldb.cpp:176] Opened db in 9.386828ms
I0813 10:19:4= 3.247138 =C2=A02542 leveldb.cpp:183] Compacted db in 1.956669ms
I= 0813 10:19:43.247194 =C2=A02542 leveldb.cpp:198] Created db iterator in 139= 61ns
I0813 10:19:43.247206 =C2=A02542 leveldb.cpp:204] Seeked to = beginning of db in 677ns
I0813 10:19:43.247215 =C2=A02542 leveldb= .cpp:273] Iterated through 0 keys in the db in 243ns
I0813 10:19:= 43.247252 =C2=A02542 replica.cpp:744] Replica recovered with log positions = 0 -> 0 with 1 holes and 0 unlearned
I0813 10:19:43.248755 =C2= =A02611 log.cpp:238] Attempting to join replica to ZooKeeper group
I0813 10:19:43.248924 =C2=A02542 main.cpp:306] Starting Mesos master
I0813 10:19:43.249244 =C2=A02612 recover.cpp:449] Starting replica re= covery
I0813 10:19:43.250239 =C2=A02612 recover.cpp:475] Replica = is in EMPTY status
I0813 10:19:43.250819 =C2=A02612 replica.cpp:6= 41] Replica in EMPTY status received a broadcasted recover request
I0813 10:19:43.251014 =C2=A02607 recover.cpp:195] Received a recover resp= onse from a replica in EMPTY status
I0813 10:19:43.249503 =C2= =A02542 master.cpp:349] Master 20150813-101943-151938119-5050-2542 (71.100.= 14.9) started on 71.1= 00.14.9:5050
I0813 10:19:43.252053 =C2=A02610 recover.cpp= :566] Updating replica status to STARTING
I0813 10:19:43.252571 = =C2=A02542 master.cpp:397] Master allowing unauthenticated frameworks to re= gister
I0813 10:19:43.253159 =C2=A02542 master.cpp:402] Master al= lowing unauthenticated slaves to register
I0813 10:19:43.254276 = =C2=A02612 leveldb.cpp:306] Persisting metadata (8 bytes) to leveldb took 1= .816161ms
I0813 10:19:43.254323 =C2=A02612 replica.cpp:323] Persi= sted replica status to STARTING
I0813 10:19:43.254905 =C2=A02612 = recover.cpp:475] Replica is in STARTING status
I0813 10:19:43.255= 203 =C2=A02612 replica.cpp:641] Replica in STARTING status received a broad= casted recover request
I0813 10:19:43.255265 =C2=A02612 recover.c= pp:195] Received a recover response from a replica in STARTING status
=
I0813 10:19:43.255343 =C2=A02612 recover.cpp:566] Updating replica sta= tus to VOTING
I0813 10:19:43.258730 =C2=A02611 master.cpp:1295] S= uccessfully attached file '/var/log/mesos/mesos-master.INFO'
<= div>I0813 10:19:43.258760 =C2=A02609 contender.cpp:131] Joining the ZK grou= p
I0813 10:19:43.258862 =C2=A02612 leveldb.cpp:306] Persisting me= tadata (8 bytes) to leveldb took 3.477458ms
I0813 10:19:43.258894= =C2=A02612 replica.cpp:323] Persisted replica status to VOTING
I= 0813 10:19:43.258934 =C2=A02612 recover.cpp:580] Successfully joined the Pa= xos group
I0813 10:19:43.258987 =C2=A02612 recover.cpp:464] Recov= er process terminated
I0813 10:19:46.590340 =C2=A02606 group.cpp:= 313] Group process (group(1)@71.100.14.9:5050) connected to ZooKeeper
I0813 10:19:= 46.590373 =C2=A02606 group.cpp:790] Syncing group operations: queue size (j= oins, cancels, datas) =3D (0, 0, 0)
I0813 10:19:46.590386 =C2=A02= 606 group.cpp:385] Trying to create path '/mesos/log_replicas' in Z= ooKeeper
I0813 10:19:46.591442 =C2=A02606 network.hpp:424] ZooKee= per group memberships changed
I0813 10:19:46.591514 =C2=A02606 gr= oup.cpp:659] Trying to get '/mesos/log_replicas/0000000000' in ZooK= eeper
I0813 10:19:46.592146 =C2=A02606 group.cpp:659] Trying to g= et '/mesos/log_replicas/0000000001' in ZooKeeper
I0813 10= :19:46.593128 =C2=A02608 network.hpp:466] ZooKeeper group PIDs: { log-repli= ca(1)@127.0.1.1:5050 }
I0813 10:19:46.593977 =C2=A026= 08 group.cpp:790] Syncing group operations: queue size (joins, cancels, dat= as) =3D (1, 0, 0)
I0813 10:19:46.593986 =C2=A02608 group.cpp:385]= Trying to create path '/mesos/log_replicas' in ZooKeeper
I0813 10:19:46.594894 =C2=A02605 group.cpp:313] Group process (group(3)@71.100.14.9:5050) c= onnected to ZooKeeper
I0813 10:19:46.594992 =C2=A02605 group.cpp:= 790] Syncing group operations: queue size (joins, cancels, datas) =3D (1, 0= , 0)
I0813 10:19:46.595007 =C2=A02605 group.cpp:385] Trying to cr= eate path '/mesos' in ZooKeeper
I0813 10:19:46.595654 =C2= =A02607 group.cpp:313] Group process (group(4)@71.100.14.9:5050) connected to ZooKeeper
I0813 10:19:46.595741 =C2=A02607 group.cpp:790] Syncing group operati= ons: queue size (joins, cancels, datas) =3D (0, 0, 0)
I0813 10:19= :46.595785 =C2=A02607 group.cpp:385] Trying to create path '/mesos'= in ZooKeeper
I0813 10:19:46.598635 =C2=A02612 network.hpp:424] Z= ooKeeper group memberships changed
I0813 10:19:46.598775 =C2=A026= 12 group.cpp:659] Trying to get '/mesos/log_replicas/0000000000' in= ZooKeeper
I0813 10:19:46.599954 =C2=A02612 group.cpp:659] Trying= to get '/mesos/log_replicas/0000000001' in ZooKeeper
I08= 13 10:19:46.600307 =C2=A02611 contender.cpp:247] New candidate (id=3D'3= ') has entered the contest for leadership
I0813 10:19:46.6007= 21 =C2=A02612 group.cpp:659] Trying to get '/mesos/log_replicas/0000000= 002' in ZooKeeper
I0813 10:19:46.601297 =C2=A02612 network.hp= p:466] ZooKeeper group PIDs: { log-replica(1)@127.0.1.1:5050, log-replica(1)@71.100.14.9:5050 }
I0813= 10:19:46.601752 =C2=A02607 detector.cpp:138] Detected a new leader: (id=3D= '0')
I0813 10:19:46.601850 =C2=A02611 group.cpp:659] Tryi= ng to get '/mesos/info_0000000000' in ZooKeeper
I0813 = 10:19:46.602330 =C2=A02611 detector.cpp:452] A new leading master (UPID=3D<= a href=3D"http://127.0.1.1:5050" target=3D"_blank">master@127.0.1.1:5050) is detected
I0813 10:19:46.602412 =C2=A02607 master.cpp= :1356] The newly elected leader is master@127.0.1.1:5050 with id 20150813-101601-16842879-5050= -6368
I0813 10:19:58.542353 =C2=A02611 http.cpp:516] HTTP req= uest for '/master/state.json'
I0813 10:19:59.457691 =C2= =A02612 http.cpp:516] HTTP request for '/master/state.json'
I0813 10:20:00.355845 =C2=A02606 http.cpp:516] HTTP request for '/ma= ster/state.json'
I0813 10:20:06.577448 =C2=A02609 http.cpp:35= 2] HTTP request for '/master/redirect'




<= font color=3D"#888888">--
Best Regards,
Haosdent Huang


=
--001a113d2d04068158051d44f0df--