mesos-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Rogier Dikkes <rogier.dik...@surfsara.nl>
Subject Marathon split brain situation
Date Fri, 28 Aug 2015 14:05:32 GMT
Hello all,

I am running test cluster with Mesos and Marathon in a cluster of 20 
compute nodes and 2 head nodes running vm's that host all masters, 
frameworks etc. Till the 0.23 update there were not many issues but 
today i seen an issue that i must share and hope you guys know more about.

We run an updated Mesos version 0.23 and Marathon 0.10.0.

I started a hdfs namenode on docker through marathon and a couple of 
data nodes on the agents, im slowly building this config further with 
secondary namenodes, datanodes, journal nodes all in containers. For now 
its a very basic setup to see how stable everything is and what we 
should consider when running in containers.

Today we found out that the marathon leader suddenly was registered 2 
times as framework with different id's and to make it worse: It spawned 
task again that was already running. Suddenly we had 2 namenodes with 
the name management. Our consul cluster auto registered both containers 
and started to forward all traffic to these 2 namenodes.

I always thought that zookeeper was taking care of election for marathon 
and this should prevent scenario's like this. However both frameworks 
had a different ID, which should explain why zookeeper didn't handle the 
election.

The marathon web interface was no longer responding and everything timed 
out, i found out that there was only a single marathon process was 
running. To get hdfs back running again i killed the containers and 
killed the marathon process. From logs i couldn't gather why this 
happens, the 10 minutes around the registration of the framework there 
is nothing but offers, http calls and task syncs.

The strange thing i just noticed is that marathon incidentally 
re-registers itself while its process is not restarted or elected.

Does anyone have an idea where to look?

-- 
Rogier Dikkes
Systeem Programmeur Hadoop & HPC Cloud
SURFsara | Science Park 140 | 1098 XG Amsterdam


Mime
View raw message