aurora-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Maxim Khutornenko <ma...@apache.org>
Subject Re: Lost jobs on cluster failure
Date Wed, 17 Jun 2015 16:27:38 GMT
The GcExecutorLauncher uses Mesos offered resources to launch GC
tasks. While it does not use much (1), it will take a bite off of your
resource pool quite regularly. This will reduce your bin-packing
efficiency if executed often.

Also, as Bill mentioned the current GC model is slated for removal
very soon. There is task reconciliation system in place you may want
to use instead. It does not consume Mesos offers and uses a
combination of Aurora/Mesos reconciliation runs to bring the system
back in check (see AURORA-1047). It may be enabled by setting
"-reconciliation_initial_delay=0mins" and disabling the GC executor
launcher via removing "-gc_executor_path" option. You can dial in the
reconciliation frequency by playing with
"-reconciliation_explicit_interval" and
"-reconciliation_implicit_interval" options.

(1) - https://github.com/apache/aurora/blob/827b9abea48babe53ad5b2c521757c60f04c6dfc/src/main/java/org/apache/aurora/scheduler/async/GcExecutorLauncher.java#L76

On Wed, Jun 17, 2015 at 7:29 AM, Mauricio Garavaglia
<mauriciogaravaglia@gmail.com> wrote:
> Thanks so much for the answers guys, they are really helpful.
>
> On Wed, Jun 17, 2015 at 1:57 AM, Bill Farner <wfarner@apache.org> wrote:
>
>> Maxim's reply is correct, elaborating
>>
>> Should it assume the Mesos list is complete, and assume the missing nodes
>> > are indeed gone, and hence restart the jobs?
>>
>>
>> Yes.  This scenario is currently reconciled by the GC executor, which runs
>> on an hourly interval by default.  This behavior is soon to be replaced by
>> a newer process that should be able to provide greater responsiveness in
>> this situation.
>>
>
> How expensive the gc operation is? is it safe to execute it more
> frequently? (like each 10 minutes)
>
>
>
>> is there any guarantee that not multiple instances of the same job will be
>> > started?
>>
>>
>> Nope!  Aurora is designed to converge towards the desired number of
>> instances of a job, but errs on the side of over-provisioning.  This tends
>> to be the desired behavior in more cases than not.  Applications requiring
>> an at-most instance count must implement that in the application layer,
>> likely leaning on something like ZooKeeper or etcd.
>>
>> If we had health checks, we could presumably use those to validate that the
>> > job is, indeed, truly dead. Would that work?
>>
>>
>> Health checks would not change behavior in this scenario, as it's only used
>> for node-local liveness monitoring.
>>
>> -=Bill
>>
>> On Tue, Jun 16, 2015 at 2:34 PM, Maxim Khutornenko <maxim@apache.org>
>> wrote:
>>
>> > Not sure I am getting the problem here. Are you observing Mesos
>> > master, Aurora leader or a native log quorum loss?
>> >
>> > To your questions, every part of the Aurora/Mesos system is designed
>> > in a failure-tolerant manner. A loss of Mesos master, Aurora leader or
>> > a Mesos slave should not cause any irrecoverable data loss. All
>> > efforts are made to ensure tasks are restarted to compensate for any
>> > lost instances. There should be no duplicate jobs but there could be
>> > duplicate task instances for some time until Aurora/Mesos reconcile
>> > their state (usually within 1 hour).
>> >
>> > As for job health monitoring, I'd recommend exporting and alerting on
>> > job stats (similar to scheduler stats exposed via /vars endpoint).
>> >
>> > Thanks,
>> > Maxim
>> >
>> > On Tue, Jun 16, 2015 at 2:19 PM, Mauricio Garavaglia
>> > <mauriciogaravaglia@gmail.com> wrote:
>> > > Hello!
>> > >
>> > > We had a issue with our aurora mesos cluster that make it to lose
>> quorum.
>> > > And we are wondering how the recover of lost jobs works. So, what
>> happen
>> > is
>> > > basically
>> > >
>> > > #1 Start Aurora job, and have it allocated to node A.
>> > > #2 Aurora Schedulers, Mesos Master and ZK stopped
>> > > #3 node A stopped
>> > > #4 Aurora Schedulers, Mesos Master and ZK started again
>> > >
>> > > Should it assume the Mesos list is complete, and assume the missing
>> nodes
>> > > are indeed gone, and hence restart the jobs? is there any guarantee
>> that
>> > > not multiple instances of the same job will be started?
>> > >
>> > > If we had health checks, we could presumably use those to validate that
>> > the
>> > > job is, indeed, truly dead. Would that work?
>> > >
>> > > Thanks!
>> >
>>

Mime
View raw message