aurora-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Hussein Elgridly <huss...@broadinstitute.org>
Subject Re: Making sense of Aurora terminal states
Date Fri, 20 Feb 2015 16:08:48 GMT
This is fantastic (and I'm glad that my understanding was mostly correct) -
thanks a lot.

Might I suggest folding this information into the user guide? Maybe it's
only relevant for my use case, but I feel like "tasks in terminal states
might be cloned and rescheduled; here's when that might happened" isn't
made as explicit as it could be. I know I'd have had an easier time if
there had been an explanation of "here's what each state means and what
might happen next", and I can imagine [weasel words; citation needed] that
other users might also find this useful.

Hussein Elgridly
Senior Software Engineer, DSDE
The Broad Institute of MIT and Harvard


On 19 February 2015 at 17:35, Bill Farner <wfarner@apache.org> wrote:

> On Thu, Feb 19, 2015 at 1:27 PM, Hussein Elgridly <
> hussein@broadinstitute.org> wrote:
>
> > I've just spent the afternoon making a flowchart out of
> > TaskStateMachine.java in an attempt to figure out what Aurora states
> > actually mean. Given that all the jobs I submit have unique names and I
> > don't permit retries, I would like to put together a set of rules that
> > determine whether a job is _really_ terminal and definitely won't be
> > rescheduled.
> >
> > Would one of the Aurora devs be willing to play a game of True or False
> > with the following statements?
> >
> > 1. If all my job names are unique and I do an aurora job status
> > --write-json, there will be at most one element in the "active" list.
> >
>
> True iff the job has only one instance.
>
>
> > 2. Jobs in the "inactive" list are ordered by last update time, most
> recent
> > first.
> >
>
> False.  They are sorted by instance ID [1], which doesn't make much sense.
>
> [1]
>
> https://github.com/apache/incubator-aurora/blob/9fe6d5408d4aed113a239f22fa5c43aa4f9ae338/src/main/python/apache/aurora/client/cli/jobs.py#L635-L636
>
>
> > 3. A job's "status" will always equal the status of the last item in its
> > "taskEvents" list.
> >
>
> True.
>
>
> > 4. The full list of terminal states is [LOST, FINISHED, FAILED, KILLED].
> A
> > job that is not in one of these states will undergo more transitions and
> > will remain in the "active" list until it gets to one of these states.
> > (Will I ever see DELETED, or do they not show up in aurora job status?)
> >
>
> True.  Source of truth is [1].  We actually don't have a state [2] for
> DELETED.
>
> [1]
>
> https://github.com/apache/incubator-aurora/blob/9fe6d5408d4aed113a239f22fa5c43aa4f9ae338/api/src/main/thrift/org/apache/aurora/gen/api.thrift#L410-L413
> [2]
>
> https://github.com/apache/incubator-aurora/blob/9fe6d5408d4aed113a239f22fa5c43aa4f9ae338/api/src/main/thrift/org/apache/aurora/gen/api.thrift#L348-L380
>
>
> > 5. A job in the LOST state will always be rescheduled unless it went
> > through KILLING first. (What does this represent - killed by user and
> then
> > lost connectivity to the slave?)
> >
>
> True.  That is one way it could happen, it could also happen if the
> scheduler times the task out while waiting to hear back from mesos after
> attempting to kill the task.
>
>
> > 6. A job will be rescheduled if if it goes through one of [RESTARTING,
> > DRAINING, PREEMPTING].
> >
>
> True.
>
>
> > 7. Assuming maxTaskFailures = 1, #5 and #6 are the ONLY situations in
> which
> > a job will be rescheduled.
> >
>
> True.
>
>
> > 8. These rules are unlikely to change in the future ;)
> >
>
> True, though we could add more states, which would invalidate (4) and (6).
> In practice, we have changed the states and their meanings very little in
> ~5 years.
>
>
> > Finally, I noticed something odd: ASSIGNED -> LOST has followups [KILL,
> > RESCHEDULE], but STARTING and RUNNING -> LOST only has [RESCHEDULE] as a
> > followup. Why?
> >
>
> This is because ASSIGNED -> LOST may mean that there was a race between
> creating the task and Aurora timing out the launch (it may not have heard
> back from mesos).  To reduce the likelihood of a redundant instance, we try
> to proactively kill the race.  The RUNNING state does not time out, so we
> do not have the same concern there.
>
>
> > Thanks,
> > Hussein Elgridly
> > Senior Software Engineer, DSDE
> > The Broad Institute of MIT and Harvard
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message