mesos-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marco Massenzio <ma...@mesosphere.io>
Subject Re: Can't start master properly (stale state issue?); help!
Date Thu, 13 Aug 2015 19:20:34 GMT
On Thu, Aug 13, 2015 at 11:53 AM, Paul Bell <arachweb@gmail.com> wrote:

> Marco & hasodent,
>
> This is just a quick note to say thank you for your replies.
>
> No problem, you're welcome.


> I will answer you much more fully tomorrow, but for now can only manage a
> few quick observations & questions:
>
> 1. Having some months ago encountered a known problem with the IP@
> 127.0.1.1 (I'll provide references tomorrow), I early on configured
> /etc/hosts, replacing "myHostName 127.0.1.1" with "myHostName <Real_IP>".
> That said, I can't rule out a race condition whereby ZK | mesos-master saw
> the original unchanged /etc/hosts before I zapped it.
>
> 2. What is a znode and how would I drop it?
>
> so, the znode is the fancy name that ZK gives to the nodes in its tree
(trivially, the "path") - assuming that you give Mesos the following ZK URL:
zk://10.10.0.5:2181/mesos/prod

the 'znode' would be `/mesos/prod` and you could go inspect it (using
zkCli.sh) by doing:
> ls /mesos/prod

you should see at least one (with the Master running) file: info_0000001 or
json.info_00000001 (depending on whether you're running 0.23 or 0.24) and
you could then inspect its contents with:
> get /mesos/prod/info_0000001

For example, if I run a Mesos 0.23 on my localhost, against ZK on the same:

$ ./bin/mesos-master.sh --zk=zk://localhost:2181/mesos/test --quorum=1
--work_dir=/tmp/m23-2 --port=5053
I can connect to ZK via zkCli.sh and:

[zk: localhost:2181(CONNECTED) 4] ls /mesos/test
[info_0000000006, log_replicas]
[zk: localhost:2181(CONNECTED) 6] get /mesos/test/info_0000000006
#20150813-120952-18983104-5053-14072ц '"master@192.168.33.1:5053
*.... 192.168.33.120.23.0

cZxid = 0x314
dataLength = 93
.... // a bunch of other metadata
numChildren = 0

(you can remove it with - you guessed it - `rm -f /mesos/test` at the CLI
prompt - stop Mesos first, or it will be a very unhappy Master :).
in the corresponding logs I see (note the "new leader" here too, even
though this was the one and only):

I0813 12:09:52.126509 105455616 group.cpp:656] Trying to get
'/mesos/test/info_0000000006' in ZooKeeper
W0813 12:09:52.127071 107065344 detector.cpp:444] Leading master
master@192.168.33.1:5053 is using a Protobuf binary format when registering
with ZooKeeper (info): this will be deprecated as of Mesos 0.24 (see
MESOS-2340)
I0813 12:09:52.127094 107065344 detector.cpp:481] A new leading master
(UPID=master@192.168.33.1:5053) is detected
I0813 12:09:52.127187 103845888 master.cpp:1481] The newly elected leader
is master@192.168.33.1:5053 with id 20150813-120952-18983104-5053-14072
I0813 12:09:52.127209 103845888 master.cpp:1494] Elected as the leading
master!


At this point, I'm almost sure you're running up against some issue with
the log-replica; but I'm the least competent guy here to help you on that
one, hopefully someone else will be able to add insight here.

I start the services (zk, master, marathon; all on same host) by SSHing
> into the host & doing "service XXXX start" commands.
>
> Again, thanks very much; and more tomorrow.
>
> Cordially,
>
> Paul
>
> On Thu, Aug 13, 2015 at 1:08 PM, haosdent <haosdent@gmail.com> wrote:
>
>> Hello, how you start the master? And could you try use "netstat
>> -antp|grep 5050" to find whether there are multi master processes run at a
>> same machine or not?
>>
>> On Thu, Aug 13, 2015 at 10:37 PM, Paul Bell <arachweb@gmail.com> wrote:
>>
>>> Hi All,
>>>
>>> I hope someone can shed some light on this because I'm getting desperate!
>>>
>>> I try to start components zk, mesos-master, and marathon in that order.
>>> They are started via a program that SSHs to the sole host and does "service
>>> xxx start". Everyone starts happily enough. But the Mesos UI shows me:
>>>
>>> *This master is not the leader, redirecting in 0 seconds ... go now*
>>>
>>> The pattern seen in all of the mesos-master.INFO logs (one of which
>>> shown below) is that the mesos-master with the correct IP@ starts. But
>>> then a "new leader" is detected and becomes leading master. This new leader
>>> shows UPID *(UPID=master@127.0.1.1:5050 <http://master@127.0.1.1:5050>*
>>>
>>> I've tried clearing what ZK and mesos-master state I can find, but this
>>> problem will not "go away".
>>>
>>> Would someone be so kind as to a) explain what is happening here and b)
>>> suggest remedies?
>>>
>>> Thanks very much.
>>>
>>> -Paul
>>>
>>>
>>> Log file created at: 2015/08/13 10:19:43
>>> Running on machine: 71.100.14.9
>>> Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg
>>> I0813 10:19:43.225636  2542 logging.cpp:172] INFO level logging started!
>>> I0813 10:19:43.235213  2542 main.cpp:181] Build: 2015-05-05 06:15:50 by
>>> root
>>> I0813 10:19:43.235244  2542 main.cpp:183] Version: 0.22.1
>>> I0813 10:19:43.235257  2542 main.cpp:186] Git tag: 0.22.1
>>> I0813 10:19:43.235268  2542 main.cpp:190] Git SHA:
>>> d6309f92a7f9af3ab61a878403e3d9c284ea87e0
>>> I0813 10:19:43.245098  2542 leveldb.cpp:176] Opened db in 9.386828ms
>>> I0813 10:19:43.247138  2542 leveldb.cpp:183] Compacted db in 1.956669ms
>>> I0813 10:19:43.247194  2542 leveldb.cpp:198] Created db iterator in
>>> 13961ns
>>> I0813 10:19:43.247206  2542 leveldb.cpp:204] Seeked to beginning of db
>>> in 677ns
>>> I0813 10:19:43.247215  2542 leveldb.cpp:273] Iterated through 0 keys in
>>> the db in 243ns
>>> I0813 10:19:43.247252  2542 replica.cpp:744] Replica recovered with log
>>> positions 0 -> 0 with 1 holes and 0 unlearned
>>> I0813 10:19:43.248755  2611 log.cpp:238] Attempting to join replica to
>>> ZooKeeper group
>>> I0813 10:19:43.248924  2542 main.cpp:306] Starting Mesos master
>>> I0813 10:19:43.249244  2612 recover.cpp:449] Starting replica recovery
>>> I0813 10:19:43.250239  2612 recover.cpp:475] Replica is in EMPTY status
>>> I0813 10:19:43.250819  2612 replica.cpp:641] Replica in EMPTY status
>>> received a broadcasted recover request
>>> I0813 10:19:43.251014  2607 recover.cpp:195] Received a recover response
>>> from a replica in EMPTY status
>>> *I0813 10:19:43.249503  2542 master.cpp:349] Master
>>> 20150813-101943-151938119-5050-2542 (71.100.14.9) started on
>>> 71.100.14.9:5050 <http://71.100.14.9:5050>*
>>> I0813 10:19:43.252053  2610 recover.cpp:566] Updating replica status to
>>> STARTING
>>> I0813 10:19:43.252571  2542 master.cpp:397] Master allowing
>>> unauthenticated frameworks to register
>>> I0813 10:19:43.253159  2542 master.cpp:402] Master allowing
>>> unauthenticated slaves to register
>>> I0813 10:19:43.254276  2612 leveldb.cpp:306] Persisting metadata (8
>>> bytes) to leveldb took 1.816161ms
>>> I0813 10:19:43.254323  2612 replica.cpp:323] Persisted replica status to
>>> STARTING
>>> I0813 10:19:43.254905  2612 recover.cpp:475] Replica is in STARTING
>>> status
>>> I0813 10:19:43.255203  2612 replica.cpp:641] Replica in STARTING status
>>> received a broadcasted recover request
>>> I0813 10:19:43.255265  2612 recover.cpp:195] Received a recover response
>>> from a replica in STARTING status
>>> I0813 10:19:43.255343  2612 recover.cpp:566] Updating replica status to
>>> VOTING
>>> I0813 10:19:43.258730  2611 master.cpp:1295] Successfully attached file
>>> '/var/log/mesos/mesos-master.INFO'
>>> I0813 10:19:43.258760  2609 contender.cpp:131] Joining the ZK group
>>> I0813 10:19:43.258862  2612 leveldb.cpp:306] Persisting metadata (8
>>> bytes) to leveldb took 3.477458ms
>>> I0813 10:19:43.258894  2612 replica.cpp:323] Persisted replica status to
>>> VOTING
>>> I0813 10:19:43.258934  2612 recover.cpp:580] Successfully joined the
>>> Paxos group
>>> I0813 10:19:43.258987  2612 recover.cpp:464] Recover process terminated
>>> I0813 10:19:46.590340  2606 group.cpp:313] Group process (group(1)@
>>> 71.100.14.9:5050) connected to ZooKeeper
>>> I0813 10:19:46.590373  2606 group.cpp:790] Syncing group operations:
>>> queue size (joins, cancels, datas) = (0, 0, 0)
>>> I0813 10:19:46.590386  2606 group.cpp:385] Trying to create path
>>> '/mesos/log_replicas' in ZooKeeper
>>> I0813 10:19:46.591442  2606 network.hpp:424] ZooKeeper group memberships
>>> changed
>>> I0813 10:19:46.591514  2606 group.cpp:659] Trying to get
>>> '/mesos/log_replicas/0000000000' in ZooKeeper
>>> I0813 10:19:46.592146  2606 group.cpp:659] Trying to get
>>> '/mesos/log_replicas/0000000001' in ZooKeeper
>>> I0813 10:19:46.593128  2608 network.hpp:466] ZooKeeper group PIDs: {
>>> log-replica(1)@127.0.1.1:5050 }
>>> I0813 10:19:46.593955  2608 group.cpp:313] Group process (group(2)@
>>> 71.100.14.9:5050) connected to ZooKeeper
>>> I0813 10:19:46.593977  2608 group.cpp:790] Syncing group operations:
>>> queue size (joins, cancels, datas) = (1, 0, 0)
>>> I0813 10:19:46.593986  2608 group.cpp:385] Trying to create path
>>> '/mesos/log_replicas' in ZooKeeper
>>> I0813 10:19:46.594894  2605 group.cpp:313] Group process (group(3)@
>>> 71.100.14.9:5050) connected to ZooKeeper
>>> I0813 10:19:46.594992  2605 group.cpp:790] Syncing group operations:
>>> queue size (joins, cancels, datas) = (1, 0, 0)
>>> I0813 10:19:46.595007  2605 group.cpp:385] Trying to create path
>>> '/mesos' in ZooKeeper
>>> I0813 10:19:46.595654  2607 group.cpp:313] Group process (group(4)@
>>> 71.100.14.9:5050) connected to ZooKeeper
>>> I0813 10:19:46.595741  2607 group.cpp:790] Syncing group operations:
>>> queue size (joins, cancels, datas) = (0, 0, 0)
>>> I0813 10:19:46.595785  2607 group.cpp:385] Trying to create path
>>> '/mesos' in ZooKeeper
>>> I0813 10:19:46.598635  2612 network.hpp:424] ZooKeeper group memberships
>>> changed
>>> I0813 10:19:46.598775  2612 group.cpp:659] Trying to get
>>> '/mesos/log_replicas/0000000000' in ZooKeeper
>>> I0813 10:19:46.599954  2612 group.cpp:659] Trying to get
>>> '/mesos/log_replicas/0000000001' in ZooKeeper
>>> I0813 10:19:46.600307  2611 contender.cpp:247] New candidate (id='3')
>>> has entered the contest for leadership
>>> I0813 10:19:46.600721  2612 group.cpp:659] Trying to get
>>> '/mesos/log_replicas/0000000002' in ZooKeeper
>>> I0813 10:19:46.601297  2612 network.hpp:466] ZooKeeper group PIDs: {
>>> log-replica(1)@127.0.1.1:5050, log-replica(1)@71.100.14.9:5050 }
>>> I0813 10:19:46.601752  2607 detector.cpp:138] Detected a new leader:
>>> (id='0')
>>> I0813 10:19:46.601850  2611 group.cpp:659] Trying to get
>>> '/mesos/info_0000000000' in ZooKeeper
>>> *I0813 10:19:46.602330  2611 detector.cpp:452] A new leading master
>>> (UPID=master@127.0.1.1:5050 <http://master@127.0.1.1:5050>) is detected*
>>> *I0813 10:19:46.602412  2607 master.cpp:1356] The newly elected leader
>>> is master@127.0.1.1:5050 <http://master@127.0.1.1:5050> with id
>>> 20150813-101601-16842879-5050-6368*
>>> I0813 10:19:58.542353  2611 http.cpp:516] HTTP request for
>>> '/master/state.json'
>>> I0813 10:19:59.457691  2612 http.cpp:516] HTTP request for
>>> '/master/state.json'
>>> I0813 10:20:00.355845  2606 http.cpp:516] HTTP request for
>>> '/master/state.json'
>>> I0813 10:20:06.577448  2609 http.cpp:352] HTTP request for
>>> '/master/redirect'
>>>
>>>
>>
>>
>> --
>> Best Regards,
>> Haosdent Huang
>>
>
>

Mime
View raw message