aurora-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bill Farner <wfar...@apache.org>
Subject Re: Lost jobs on cluster failure
Date Wed, 17 Jun 2015 04:57:16 GMT
Maxim's reply is correct, elaborating

Should it assume the Mesos list is complete, and assume the missing nodes
> are indeed gone, and hence restart the jobs?


Yes.  This scenario is currently reconciled by the GC executor, which runs
on an hourly interval by default.  This behavior is soon to be replaced by
a newer process that should be able to provide greater responsiveness in
this situation.

is there any guarantee that not multiple instances of the same job will be
> started?


Nope!  Aurora is designed to converge towards the desired number of
instances of a job, but errs on the side of over-provisioning.  This tends
to be the desired behavior in more cases than not.  Applications requiring
an at-most instance count must implement that in the application layer,
likely leaning on something like ZooKeeper or etcd.

If we had health checks, we could presumably use those to validate that the
> job is, indeed, truly dead. Would that work?


Health checks would not change behavior in this scenario, as it's only used
for node-local liveness monitoring.

-=Bill

On Tue, Jun 16, 2015 at 2:34 PM, Maxim Khutornenko <maxim@apache.org> wrote:

> Not sure I am getting the problem here. Are you observing Mesos
> master, Aurora leader or a native log quorum loss?
>
> To your questions, every part of the Aurora/Mesos system is designed
> in a failure-tolerant manner. A loss of Mesos master, Aurora leader or
> a Mesos slave should not cause any irrecoverable data loss. All
> efforts are made to ensure tasks are restarted to compensate for any
> lost instances. There should be no duplicate jobs but there could be
> duplicate task instances for some time until Aurora/Mesos reconcile
> their state (usually within 1 hour).
>
> As for job health monitoring, I'd recommend exporting and alerting on
> job stats (similar to scheduler stats exposed via /vars endpoint).
>
> Thanks,
> Maxim
>
> On Tue, Jun 16, 2015 at 2:19 PM, Mauricio Garavaglia
> <mauriciogaravaglia@gmail.com> wrote:
> > Hello!
> >
> > We had a issue with our aurora mesos cluster that make it to lose quorum.
> > And we are wondering how the recover of lost jobs works. So, what happen
> is
> > basically
> >
> > #1 Start Aurora job, and have it allocated to node A.
> > #2 Aurora Schedulers, Mesos Master and ZK stopped
> > #3 node A stopped
> > #4 Aurora Schedulers, Mesos Master and ZK started again
> >
> > Should it assume the Mesos list is complete, and assume the missing nodes
> > are indeed gone, and hence restart the jobs? is there any guarantee that
> > not multiple instances of the same job will be started?
> >
> > If we had health checks, we could presumably use those to validate that
> the
> > job is, indeed, truly dead. Would that work?
> >
> > Thanks!
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message