aurora-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From David McLaughlin <da...@dmclaughlin.com>
Subject Re: Aurora reconciliation and Master fail over
Date Mon, 17 Jul 2017 22:00:26 GMT
Based on the thread in the Mesos dev list, it looks like because they don't
persist task information so they don't have the task IDs to send when they
detect the agent is lost during failover. So unless this is changed on the
Mesos side, we need to act on the slaveLost message and mark all those
tasks as LOST in Aurora.

Or rely on reconciliation. To reconcile more often, you should keep in mind:

1) Implicit reconciliation sends one message to Mesos and Mesos replies
with N number of status updates immediately, where N = number of running
tasks. This process is usually quick (on the order of seconds) due to being
mostly NOOP status updates. When you have a large number of running tasks
(say 100k+), you may see some GC pressure due to the flood of status
updates. If this operation overlapped with another particularly expensive
operation (like a snapshot) it can cause a huge stop the world GC. But it
does not otherwise interfere with any operation.

2) Explicit reconciliation is done in batches, where Aurora batches up all
running tasks and sends one batch at a time, staggered by some delay. The
benefit here is there is less GC pressure, but the drawback is if you have
a lot of running tasks (again, 100k+), it will take over 10 minutes to
complete. So you have to make sure your reconciliation interval is aligned
with this (you can always increase the batch size to make this happen
faster).

Cheers,
David

On Sun, Jul 16, 2017 at 11:10 AM, Meghdoot bhattacharya <
meghdoot_b@yahoo.com.invalid> wrote:

> Got it. Thx!
>
> > On Jul 16, 2017, at 9:49 AM, Stephan Erb <mail@stephanerb.eu> wrote:
> >
> > Reconciliation in Aurora is not a specific mode. It just runs
> > concurrently to other background work such as snapshots or backups [1].
> >
> >
> > Just be aware that we don't have metrics to track the runtime of
> > explicit and implicit reconciliations. If you use settings that are
> > overly aggressive, you might overload Auroras queue of incoming Mesos
> > status updates (for example).
> >
> > [1] https://github.com/apache/aurora/blob/c85bffdd6f68312261697eee868d5
> > 7069adda434/src/main/java/org/apache/aurora/scheduler/reconciliation/Ta
> > skReconciler.java
> >
> >
> >> On Sat, 2017-07-15 at 22:28 -0700, Meghdoot bhattacharya wrote:
> >> Thx David for the follow up and confirmation.
> >> We have started the thread on the mesos dev DL.
> >>
> >> So to get clarification on the recon, what is in general effect
> >> during the recon. Does scheduling and activities like snapshot is
> >> paused as recon takes place. Trying to see whether to run aggressive
> >> recon in mean time.
> >>
> >> Thx
> >>
> >>> On Jul 15, 2017, at 9:33 AM, David McLaughlin <dmclaughlin@apache.o
> >>> rg> wrote:
> >>>
> >>> I've left a comment on the initial RB detailing how the change
> >>> broke
> >>> backwards-compatibility. Given that the tasks are marked as lost as
> >>> soon as
> >>> the agent reregisters after slaveLost is sent anyway, there doesn't
> >>> seem to
> >>> be any reason not to send TASK_LOST too. I think this should be an
> >>> easy
> >>> fix.
> >>>
> >>> On Sat, Jul 15, 2017 at 9:21 AM, David McLaughlin <dmclaughlin@apac
> >>> he.org>
> >>> wrote:
> >>>
> >>>> Yes, we've confirmed this internally too (Santhosh did the work
> >>>> here):
> >>>>
> >>>> When an agent becomes unreachable while the master is running, it
> >>>> sends
> >>>>> TASK_LOST events for each task on the agent.
> >>>>> https://github.com/apache/mesos/blob/33093c893773f8c9d293afe
> >>>>> 38e9909f9a2868d32/src/master/master.cpp#L7066-L7107
> >>>>> Marking agent unreachable after failover does not cause
> >>>>> TASK_LOST events.
> >>>>> https://github.com/apache/mesos/blob/33093c893773f8c9d293afe
> >>>>> 38e9909f9a2868d32/src/master/master.cpp#L2036-L2070
> >>>>> Once an agent re-registers it sends TASK_LOST events. Agent
> >>>>> sending
> >>>>> TASK_LOST for tasks that it does not know after a Master
> >>>>> failover.
> >>>>> https://github.com/apache/mesos/blob/33093c893773f8c9d293afe
> >>>>> 38e9909f9a2868d32/src/slave/slave.cpp#L1324-L1383
> >>>>
> >>>>
> >>>>
> >>>> The separate code path for markUnreachableAfterFailover appears
> >>>> to have
> >>>> been added by this commit:
> >>>> https://github.com/apache/mesos/commit/937c85f2f6528d1ac56ea9a7aa
> >>>> 174c
> >>>> a0bd371d0c
> >>>>
> >>>> And I think this totally breaks the promise of introducing the
> >>>> PARTITION_AWARE stuff in a backwards-compatible way.
> >>>>
> >>>> So right now, yes we rely on reconciliation to finally mark the
> >>>> tasks as
> >>>> LOST and reschedule their replacements.
> >>>>
> >>>> I think the only reason we haven't been more impacted by this at
> >>>> Twitter
> >>>> is our Mesos master is remarkably stable (compared to Aurora's
> >>>> daily
> >>>> failovers).
> >>>>
> >>>> We have two paths forward here: push forward and embrace the new
> >>>> partition
> >>>> awareness features in Aurora and/or push back on the above change
> >>>> with the
> >>>> Mesos community and have a better story for non-partition aware
> >>>> APIs in the
> >>>> short term.
> >>>>
> >>>>
> >>>>
> >>>> On Sat, Jul 15, 2017 at 2:01 AM, Meghdoot bhattacharya <
> >>>> meghdoot_b@yahoo.com.invalid> wrote:
> >>>>
> >>>>> We can reproduce it easily as the steps are
> >>>>> 1. Shut down leading mesos master
> >>>>> 2. Shutdown agent at same time
> >>>>> 3. Wait for 10 mins
> >>>>>
> >>>>> What Renan and I saw in the logs were only agent lost and not
> >>>>> task lost
> >>>>> sent. While in regular health check expire scenario both task
> >>>>> lost and
> >>>>> agent lost were sent.
> >>>>>
> >>>>> So yes this is very concerning.
> >>>>>
> >>>>> Thx
> >>>>>
> >>>>>> On Jul 14, 2017, at 10:28 AM, David McLaughlin <dmclaughlin@a
> >>>>>> pache.org>
> >>>>>
> >>>>> wrote:
> >>>>>>
> >>>>>> It would be interesting to see the logs. I think that will
> >>>>>> tell you if
> >>>>>
> >>>>> the
> >>>>>> Mesos master is:
> >>>>>>
> >>>>>> a) Sending slaveLost
> >>>>>> b) Trying to send TASK_LOST
> >>>>>>
> >>>>>> And then the Scheduler logs (and/or the metrics it exports)
> >>>>>> should tell
> >>>>>
> >>>>> you
> >>>>>> whether those events were received. If this is reproducible,
> >>>>>> I'd
> >>>>>
> >>>>> consider
> >>>>>> it a serious bug.
> >>>>>>
> >>>>>> On Fri, Jul 14, 2017 at 10:04 AM, Meghdoot bhattacharya <
> >>>>>> meghdoot_b@yahoo.com.invalid> wrote:
> >>>>>>
> >>>>>>> So in this situation why is not aurora replacing the tasks
> >>>>>>> and waiting
> >>>>>
> >>>>> for
> >>>>>>> external recon to fix it.
> >>>>>>>
> >>>>>>> This is different when the 75 sec (5*15) health check of
> >>>>>>> slave times
> >>>>>
> >>>>> out
> >>>>>>> (no master failover), aurora replaces it on task lost
> >>>>>>> message.
> >>>>>>>
> >>>>>>> Are you hinting we should ask mesos folks why in master
> >>>>>>> fail over
> >>>>>>> reregistration timeout scenario why task lost not sent
> >>>>>>> though slave
> >>>>>
> >>>>> lost
> >>>>>>> sent and from below docs task lost should have been sent.
> >>>>>>>
> >>>>>>> Because either mesos is not sending the right status or
> >>>>>>> aurora is not
> >>>>>>> handling it.
> >>>>>>>
> >>>>>>> Thx
> >>>>>>>
> >>>>>>>> On Jul 14, 2017, at 8:21 AM, David McLaughlin <dmclaughli
> >>>>>>>> n@apache.org
> >>>>>>> wrote:
> >>>>>>>>
> >>>>>>>> "1. When mesos sends slave lost after 10 mins in this
> >>>>>>>> situation , why
> >>>>>>>
> >>>>>>> does
> >>>>>>>> aurora not act on it?"
> >>>>>>>>
> >>>>>>>> Because Mesos also sends TASK_LOST for every task running
> >>>>>>>> on the agent
> >>>>>>>> whenever it calls slaveLost:
> >>>>>>>>
> >>>>>>>> When it is time to remove an agent, the master removes
> >>>>>>>> the agent from
> >>>>>
> >>>>> the
> >>>>>>>> list of registered agents in the master’s durable
state
> >>>>>>>> <http://mesos.apache.org/documentation/latest/replicated-
> >>>>>
> >>>>> log-internals/>
> >>>>>>> (this
> >>>>>>>> will survive master failover). The master sends a
> >>>>>>>> slaveLost callback
> >>>>>
> >>>>> to
> >>>>>>>> every registered scheduler driver; it also sends
> >>>>>>>> TASK_LOST status
> >>>>>
> >>>>> updates
> >>>>>>>> for every task that was running on the removed agent.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On Thu, Jul 13, 2017 at 4:32 PM, meghdoot bhattacharya
<
> >>>>>>>> meghdoot_b@yahoo.com.invalid> wrote:
> >>>>>>>>
> >>>>>>>>> We were investigation slave re registration behavior
on
> >>>>>>>>> master fail
> >>>>>
> >>>>> over
> >>>>>>>>> in Aurora 0.17 with mesos 1.1.
> >>>>>>>>> Few important points
> >>>>>>>>> http://mesos.apache.org/documentation/latest/high-
> >>>>>>>>> availability-framework-guide/ (If an agent does
not
> >>>>>>>>> reregister with
> >>>>>
> >>>>> the
> >>>>>>>>> new master within a timeout (controlled by the
> >>>>>>>
> >>>>>>> --agent_reregister_timeout
> >>>>>>>>> configuration flag), the master marks the agent
as
> >>>>>>>>> failed and follows
> >>>>>>>
> >>>>>>> the
> >>>>>>>>> same steps described above. However, there is one
> >>>>>>>>> difference: by
> >>>>>>>
> >>>>>>> default,
> >>>>>>>>> agents are allowed to reconnect following master
> >>>>>>>>> failover, even after
> >>>>>>>
> >>>>>>> the
> >>>>>>>>> agent_reregister_timeout has fired. This means that
> >>>>>>>>> frameworks might
> >>>>>>>
> >>>>>>> see a
> >>>>>>>>> TASK_LOST update for a task but then later discover
> >>>>>>>>> that the task is
> >>>>>>>>> running (because the agent where it was running
was
> >>>>>>>>> allowed to
> >>>>>>>
> >>>>>>> reconnect).
> >>>>>>>>> http://mesos.apache.org/documentation/latest/reconcilia
> >>>>>>>>> tion/
> >>>>>
> >>>>> (Implicit
> >>>>>>>>> reconciliation (passing an empty list) should also
be
> >>>>>>>>> used
> >>>>>>>
> >>>>>>> periodically, as
> >>>>>>>>> a defense against data loss in the framework. Unless
a
> >>>>>>>>> strict
> >>>>>
> >>>>> registry
> >>>>>>> is
> >>>>>>>>> in use on the master, its possible for tasks to
> >>>>>>>>> resurrect from a LOST
> >>>>>>>
> >>>>>>> state
> >>>>>>>>> (without a strict registry the master does not enforce
> >>>>>>>>> agent removal
> >>>>>>>
> >>>>>>> across
> >>>>>>>>> failovers). When an unknown task is encountered,
the
> >>>>>>>>> scheduler should
> >>>>>>>
> >>>>>>> kill
> >>>>>>>>> or recover the task.)
> >>>>>>>>> https://issues.apache.org/jira/browse/MESOS-5951(Remove
> >>>>>>>>> s strict
> >>>>>>>
> >>>>>>> registry
> >>>>>>>>> mode flag from 1.1 and reverts to the old behavior
of
> >>>>>>>>> non strict
> >>>>>>>
> >>>>>>> registry
> >>>>>>>>> mode where tasks and executors were not killed on
agent
> >>>>>
> >>>>> reregistration
> >>>>>>>>> timeout on master failover)
> >>>>>>>>> So, what we find, if the slave does not come back
after
> >>>>>>>>> 10 mins
> >>>>>>>>> 1. Mesos master sends slave lost but not task lost
to
> >>>>>>>>> Aurora.2.
> >>>>>
> >>>>> Aurora
> >>>>>>>>> does not replace the tasks.3. When explicit recon
> >>>>>>>>> starts , then only
> >>>>>>>
> >>>>>>> this
> >>>>>>>>> gets corrected with aurora spawning replacement
tasks.
> >>>>>>>>> If slave restarts after 10 mins
> >>>>>>>>> 1. When implicit recon starts, this situation gets
> >>>>>>>>> fixed because in
> >>>>>>>
> >>>>>>> aurora
> >>>>>>>>> it is marked as lost and mesos sends running and
those
> >>>>>>>>> get killed and
> >>>>>>>>> replaced.
> >>>>>>>>> So, questions
> >>>>>>>>> 1. When mesos sends slave lost after 10 mins in
this
> >>>>>>>>> situation , why
> >>>>>>>
> >>>>>>> does
> >>>>>>>>> aurora not act on it?2. As per recon docs best
> >>>>>>>>> practices, explicit
> >>>>>
> >>>>> recon
> >>>>>>>>> should start followed by implicit recon on master
> >>>>>>>>> failover. Looks
> >>>>>
> >>>>> like
> >>>>>>>>> aurora is not doing that and the regular hourly
recons
> >>>>>>>>> are running
> >>>>>
> >>>>> with
> >>>>>>> 30
> >>>>>>>>> min spread between explicit and implicit. Should
aurora
> >>>>>>>>> do recon on
> >>>>>>>
> >>>>>>> master
> >>>>>>>>> fail over?
> >>>>>>>>>
> >>>>>>>>> General questions1. What is the effect on aurora
if we
> >>>>>>>>> make explicit
> >>>>>>>
> >>>>>>> recon
> >>>>>>>>> every 15 mins instead of default 1 hr? Does it slow
> >>>>>>>>> down scheduling,
> >>>>>>>
> >>>>>>> does
> >>>>>>>>> snapshot creation gets delayed etc?
> >>>>>>>>> 2. Any issue if spread between explicit recon and
> >>>>>>>>> implicit recon
> >>>>>
> >>>>> brought
> >>>>>>>>> down to 2 mins from 30 mins? probably depend on
1.
> >>>>>>>>> Thx
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>
> >>>>>
> >>
> >>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message