mesos-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Matthias Bach <>
Subject Tasks failing when restarting slave on Mesos 0.23.1
Date Thu, 14 Jan 2016 16:10:51 GMT
Hi all,

We are using Mesos 0.23.1 in combination with Aurora 0.10.0. So far we
have been using the JSON format for Mesos' credential files. However,
because of MESOS-3695 we decided to switch to the plain text format
before updating to 0.24.1. Our understanding is that this should be a
NOOP. However, on our cluster this caused multiple tasks to fail on each

I have attached two excerpts from the Mesos slave log. One were I
grepped for the executor ID of one of the failed tasks, and one were I
grepped for the ID of the corresponding container. What you can see is
that recovery of the container  is started and – immediately afterwards
– the executer killed.

Our change procedure was:
* Place the new plain-text credential file
* Restart the slave with `--credential` pointing to the new file
* Remove the old JSON credential file

We are running the Mesos slave using supervisord and use the following
isolators: cgroups/cpu, cgroups/mem, filesystem/shared, namespaces/pid,
and posix/disk. In addition we use `--enforce_container_disk_quota`.
Regarding recovery we use the options `--recover="reconnect"` and

The Thermos log does not provide any hints as to what happened. It looks
like Thermos was SIGKILLed.

Has any of you run into this problem before? Do you have an idea what
could cause this behaviour? Do you have any suggestion what information
we could look for to better understand what happens?

Kind Regards,

Dr. Matthias Bach
Senior Software Engineer
*Blue Yonder GmbH*
Ohiostraße 8
D-76149 Karlsruhe

Tel +49 (0)721 383 117 6244
Fax +49 (0)721 383 117 69
Registergericht Mannheim, HRB 704547
USt-IdNr. DE DE 277 091 535
Geschäftsführer: Jochen Bossert, Uwe Weiss (CEO)

View raw message