mesos-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jeff Schroeder <jeffschroe...@computer.org>
Subject Re: cluster confusion after zookeeper blip
Date Mon, 18 May 2015 21:19:45 GMT
Not that this is super helpful for your issue, but I ran into an identical
problem this morning with Aurora ontop of mesos where the scheduler was
inoperable due to my ZK ensemble losing quorum and generally acting bad.
However as soon as I fixed the quorum things immediately recovered. I
believe it had to do with the replicated log that Aurora uses.

On Monday, May 18, 2015, Dick Davies <dick@hellooperator.net> wrote:

> We run a 3 node marathon cluster on top of 3 mesos masters + 6 slaves.
> (mesos 0.21.0, marathon 0.7.5)
>
> This morning we had a network outage long enough for everything to
> lose zookeeper.
> Now our marathon UI is empty (all 3 marathons think someone else is a
> master, and
> marathons 'proxy to leader' feature means the REST API is toast).
>
> Odd thing is, at the mesos level, the
> mesos master UI shows no tasks running (logs mention orphaned tasks),
> but if i click into the 'slaves' tab and dig down, the slave view details
> tasks
> that are in fact active.
>
> Any way to bring order to this without needing to kill those tasks? we
> have no actual outage from a user point of view, but the cluster
> itself is pretty confused and our service discovery relies on the
> marathon API which is timing out.
>
> Although mesos has checkpointing enabled, marathon isn't running with
> checkpointing on (it's the default now but doesn't apply to existing
> frameworks apparently, and we started this around marathon 0.4.x)
>
> Would enabling checkpointing help with this kind of issue? If so, how
> do i enable it for an existing framework?
>


-- 
Text by Jeff, typos by iPhone

Mime
View raw message