aurora-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Erb, Stephan" <Stephan....@blue-yonder.com>
Subject Re: Graceful task shutdown
Date Tue, 07 Apr 2015 20:28:05 GMT
Brian, do you have any particular plans regarding your shutdown requirements? I have seen that
you have filed another issue [1] which is also concerned with graceful shutdown.

Stephan

PS: For what it's worth, I implemented the 'quick fix' version to my problem stated in the
beginning of this thread [2].

[1] https://issues.apache.org/jira/browse/AURORA-1257
[2] https://reviews.apache.org/r/32889/

________________________________________
From: Brian Brazil <brian.brazil@boxever.com>
Sent: Tuesday, March 24, 2015 10:48 PM
To: dev@aurora.incubator.apache.org
Subject: Re: Graceful task shutdown

On 24 March 2015 at 21:33, George Sirois <george@tellapart.com> wrote:

> Unfortunately I don't think my change will be able to make it in as-is.
>
> As Brian Wickman pointed out, it could introduce serious problems because
> there are varying timeouts across the scheduler/executor, so if you set
> your wait time to be too high, the scheduler might start to consider the
> tasks lost because they stayed in the transient KILLING state for too long.
>

Hmm, what sort of work is involved in resolving that?

In my case I need at least 12s after the /qqq before sending the TERM.

Brian


>
> I do think the lifecycle modules idea would solve Stephan's issue.
>
> On Tue, Mar 24, 2015 at 5:06 PM, Brian Brazil <brian.brazil@boxever.com>
> wrote:
>
> > On 24 March 2015 at 20:57, Erb, Stephan <Stephan.Erb@blue-yonder.com>
> > wrote:
> >
> > > Hi everyone,
> > >
> > > we are implementing the /health endpoint in our services but omit the
> > > implementation of the unauthenticated lifecycle methods /quitquitquit
> and
> > > /abortabortabort.
> > >
> > > As a consequence, stopping a service is taxed by 10 seconds waiting
> time
> > > [1]. I would like to get rid of this unnecessary delay and can think of
> > two
> > > solutions:
> > >
> > > a) Only perform the escalation wait when the http_signaler reports that
> > > the message could be delivered to the service. This is a rather simple
> > and
> > > localized fix.
> > >
> > > b) Use another port for lifecycle events. This would require a new
> > > addition to the task configuration and proper plumbing throughout the
> > rest
> > > of the system. Backward compatibility could be achieved by using
> 'health'
> > > as the default lifecycle management port.
> > >
> > > Any thoughts? I would be happy with the simple solution, but in the end
> > > it's your call :-)
> > >
> >
> > __george mentioned on IRC working on a change that'll let the wait time
> be
> > configurable (which is something I also need), would that cover your use
> > case?
> >
> > There were also discussions on IRC about custom lifecycle modules.
> >
> > Brian
> >
> >
> > >
> > > Best Regards,
> > > Stephan
> > >
> > > [1]
> > >
> >
> https://github.com/apache/incubator-aurora/blob/master/src/main/python/apache/aurora/executor/thermos_task_runner.py#L123
> >
>
Mime
View raw message