aurora-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Maxim Khutornenko <ma...@apache.org>
Subject Re: Lost jobs on cluster failure
Date Tue, 16 Jun 2015 21:34:48 GMT
Not sure I am getting the problem here. Are you observing Mesos
master, Aurora leader or a native log quorum loss?

To your questions, every part of the Aurora/Mesos system is designed
in a failure-tolerant manner. A loss of Mesos master, Aurora leader or
a Mesos slave should not cause any irrecoverable data loss. All
efforts are made to ensure tasks are restarted to compensate for any
lost instances. There should be no duplicate jobs but there could be
duplicate task instances for some time until Aurora/Mesos reconcile
their state (usually within 1 hour).

As for job health monitoring, I'd recommend exporting and alerting on
job stats (similar to scheduler stats exposed via /vars endpoint).

Thanks,
Maxim

On Tue, Jun 16, 2015 at 2:19 PM, Mauricio Garavaglia
<mauriciogaravaglia@gmail.com> wrote:
> Hello!
>
> We had a issue with our aurora mesos cluster that make it to lose quorum.
> And we are wondering how the recover of lost jobs works. So, what happen is
> basically
>
> #1 Start Aurora job, and have it allocated to node A.
> #2 Aurora Schedulers, Mesos Master and ZK stopped
> #3 node A stopped
> #4 Aurora Schedulers, Mesos Master and ZK started again
>
> Should it assume the Mesos list is complete, and assume the missing nodes
> are indeed gone, and hence restart the jobs? is there any guarantee that
> not multiple instances of the same job will be started?
>
> If we had health checks, we could presumably use those to validate that the
> job is, indeed, truly dead. Would that work?
>
> Thanks!

Mime
View raw message