mesos-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Geoffroy Jabouley <geoffroy.jabou...@gmail.com>
Subject Task Checkpointing with Mesos, Marathon and Docker containers
Date Tue, 25 Nov 2014 15:43:30 GMT
Hello

i am currently trying to activate checkpointing for my Mesos cloud.

Starting from an application running in a docker container on the cluster,
launched from marathon, my use cases are the followings:

*UC1: kill the marathon service, then restart after 2 minutes.*
*Expected*: the mesos task is still active, the docker container is
running. When the marathon service restarts, it get backs its tasks.

*Result*: OK


*UC2: kill the mesos slave, then restart after 2 minutes.*
*Expected*: the mesos task remains active, the docker container is running.
When the mesos slave service restarts, it get backs its tasks. Marathon
does not show error.

*Results*: task get status LOST when slave is killed. Docker container
still running.  Marathon detects the application went down and spawn a new
one on another available mesos slave. When the slave restarts, it kills the
previous running container and start a new one. So i end up with 2
applications on my cluster, one spawn by Marathon, and another orphan one.


Is this behavior normal? Can you please explain what i am doing wrong?

-----------------------------------------------------------------------------------------------------------

Here is the configuration i have come so far:
Mesos 0.19.1 (not dockerized)
Marathon 0.6.1 (not dockerized)
Docker 1.3 + Deimos 0.4.2

Mesos master is started:
*/usr/local/sbin/mesos-master --zk=zk://...:2181/mesos --port=5050
--log_dir=/var/log/mesos --cluster=CLUSTER_POC --hostname=... --ip=...
--quorum=1 --work_dir=/var/lib/mesos*

Mesos slave is started:
*/usr/local/sbin/mesos-slave --master=zk://...:2181/mesos
--log_dir=/var/log/mesos --checkpoint=true
--containerizer_path=/usr/local/bin/deimos
--executor_registration_timeout=5mins --hostname=... --ip=...
--isolation=external --recover=reconnect --recovery_timeout=120mins
--strict=true*

Marathon is started:
*java -Xmx512m -Djava.library.path=/usr/local/lib
-Djava.util.logging.SimpleFormatter.format=%2$s %5$s%6$s%n -cp
/usr/local/bin/marathon mesosphere.marathon.Main --zk
zk://...:2181/marathon --master zk://...:2181/mesos --local_port_min 30000
--hostname ... --event_subscriber http_callback --http_port 8080
--task_launch_timeout 300000 --local_port_max 40000 --ha --checkpoint*

Mime
View raw message