mesos-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Maciej Strzelecki <maciej.strzele...@crealytics.com>
Subject Marathon can no longer deploy any apps after a failover
Date Thu, 16 Jul 2015 12:29:56 GMT
Problem:


If i restart a current framework leader for marathon ( the host from active frameworks tab
in mesos ui) , a new one is elected after a moment and any new deployments are stuck infinitely
at  'deploying' state (empty black bar, 0/1 and hanging - with debug level i dont see any
errors in marathon/mesos logs)

Also the old tasks are untouchable at that time - yes, they keep running, but cant kill, restart
nor scale them.


When that happens i can:


stop marathon on all masters

remove the framework via a curl to mesos api /shutdown

purge /marathon from zookeper cli

restart docker services on all slaves (that kills the zombie containers)

restart mesos-slave services on all slaves (pampering my paranoia here)
then i can deploy apps again.


How can i avoid this problem? Any basic settings im missing? This is scary, as the reboot
of a single master (out of 3 or 5 servers) freezes everything that is deployed using marathon,
and the steps to reclaim control introduce downtime to every single app sunning there.





Configuration:


Running ubuntu 14.04.2. LTS

mesos                               0.22.1-1.0.ubuntu1404

marathon                            0.9.0-1.0.381.ubuntu1404

chronos                             2.3.4-1.0.81.ubuntu1404


The cluster  uses 3 masters and a 15 slaves. Also the master machines are running mesos-slave
process (albeit those machines give only a  portion of resources as offerrings)


The configuration for mesos/marathon is very "default" dependant, options specified You can
see below. The quorum is 2.


Marathon service is run on 3 master machines


root@mesos-master1 ~ # tree /etc/marathon/
/etc/marathon/
`-- conf
    |-- event_subscriber
    |-- framework_name
    |-- hostname
    |-- logging_level
    `-- zk

1 directory, 5 files
root@mesos-master1 ~ # tree /etc/mesos
/etc/mesos
`-- zk

0 directories, 1 file
root@mesos-master1 ~ # tree /etc/mesos-slave/
/etc/mesos-slave/
|-- containerizers
|-- docker_stop_timeout
|-- executor_registration_timeout
|-- executor_shutdown_grace_period
|-- hostname
|-- ip
|-- logging_level
`-- resources

0 directories, 8 files
root@mesos-master1 ~ # tree /etc/mesos-master
/etc/mesos-master
|-- cluster
|-- hostname
|-- ip
|-- logging_level
|-- quorum
`-- work_dir

Mime
View raw message