aurora-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Anindya Sinha <anindya.si...@gmail.com>
Subject Re: Handling of aurora update when job/task cannot be scheduled
Date Tue, 20 May 2014 16:03:48 GMT
> >
> > When a aurora create is done and if one or more instances are in PENDING
> > state, those instances may never get scheduled for the reasons you
> > mentioned in point 1
>
> Fair point. However, creating a job with a large number of instances coming
> up all at the same time is undesirable due to reasons I mentioned earlier
> (backend pressure). A common practice is to create a job with a few
> instances and then use 'aurora update' to gradually scale up the number on
> instances. We have had a request to let 'aurora create' support --shards
> (--instances) option but given the existing workaround it was
> de-prioritized. Feel free to create a ticket if you think otherwise.
>

[AS] Agreed on the recommendation. However for this to work, we rely on
users to adopt best practices as opposed to the software having the nooks
to address this problem of performance and scale/latency. This problem is
not specific to a single aurora create for a large # of instances, but
could arise due to multiple aurora creates (across multiple jobs) in a
small period of time.
Having some mechanism to throttle a spike in requests is a desirable
enhancement to aurora in my opinion.


>
> Currently, aurora create is asynchronous whereas aurora update is
> > synchronous which is leading to an inconsistent behavior of how instances
> > are scheduled in create vs update.
>
> I don't necessarily see the immediate benefits of having consistency here.
> The 'aurora create' does not carry the same contractual guarantees as
> 'aurora update' simply because its semantics does not imply service
> interruption. BTW, there is a '--wait_until' option that you can provide to
> 'aurora create' to make it feel more synchronous.
>

[AS] Actually, I prefer the asynchronous nature of aurora create since
actual scheduling is being handled by the infrastructure. This discussion
is primarily aimed at providing an option to make aurora update
asynchronous as well.


> Any failure should
> > kick in rollback (unless rollback=False) but if a new instance is in
> > PENDING, or an old instance that was in PENDING is still in PENDING after
> > the update, we do not consider it to be a failure and terminate the whole
> > update operation like it is the case today.
>
> This just shifts the problem back to the backend performance domain.
>
Perhaps I should have asked this first, what is the exact problem that you
> are trying to solve? Are you concerned about the update prematurely
> terminating your rollout before the instances exit PENDING? Have you
> considered playing with
> UpdateConfig<
> http://aurora.incubator.apache.org/documentation/latest/configuration-reference/#updateconfig-objects
> >values
> to finely tune your update procedure? Specifically, increasing the
> restart_threshold will give you more time to account for the delayed
> scheduling.
>

[AS] Yes, the problem I am trying to tackle is to not fail the update
should one or more instances cannot be scheduled within
UpdateConfig.restart_threshold seconds. Increasing
UpdateConfig.restart_threshold certainly alleviates the problem but is not
fool proof as the prediction of that may not be appropriate wrt the cluster
state at any given point of time.


> On Sun, May 18, 2014 at 10:52 PM, Anindya Sinha <anindya.sinha@gmail.com
> >wrote:
>
> > Hi
> >
> > Thanks for your feedback. Really appreciate it.
> > My responses inline starting with [AS].
> >
> > Thanks
> > Anindya
> >
> >
> > - The task in PENDING state might never get scheduled due to variety of
> > > reasons, e.g.: unsatisfied constraints, unreasonable resource
> > requirements
> > > and etc. Furthermore, if the task eventually gets scheduled, it may
> never
> > > reach RUNNING or more likely fail repeatedly and get THROTTLED for
> > > flapping. Neither can be considered a successful update criteria.
> > >
> > > - Assuming PENDING tasks are only blocked by the lack of resources and
> > get
> > > unblocked eventually, having hundreds or thousands of PENDING tasks
> > > transitioning to RUNNING at the same time may and most likely will
> result
> > > in an unpredictable performance problems on the package retrieval
> (e.g.:
> > > HDFS) or application side (e.g.: backend connections/load).
> > >
> > > [AS]  Agreed on the use case for both of the first 2 points. However,
> > those scenarios exist anyways for "aurora create" where it actually keeps
> > the instances that cannot be executed immediately in PENDING state (and
> > does not terminate them).
> > When a aurora create is done and if one or more instances are in PENDING
> > state, those instances may never get scheduled for the reasons you
> > mentioned in point 1. Furthermore, if a large number of such instances
> get
> > scheduled at the same time (in future), you could run into unpredictable
> > latency/performance issues.
> >
> > Currently, aurora create is asynchronous whereas aurora update is
> > synchronous which is leading to an inconsistent behavior of how instances
> > are scheduled in create vs update. I think providing an additional
> > mechanism for the behavior of create and update to be consistent would
> > certainly help address the "unexpected failures" due to PENDING instances
> > in a aurora update. I think that using rollback=False is undesirable due
> to
> > the inconsistent handling based on my earlier example.
> >
> > Regarding performance/latency concerns, there could be a throttling
> > mechanism built in based on the # of the instances (or based on some
> other
> > policy) that are launched in a given period of time (just a thought).
> This
> > problem is not specific to create or update but a generic one that needs
> to
> > be addressed separately.
> >
> > - A mixed update (with updating existing instances and adding new) may
> > > result in a degraded service state where some or all instances may be
> > > killed with no replacement coming online (i.e. a new update config has
> > > resource bump that cannot be satisfied). In the worst case, this may
> > result
> > > in a complete service outage.
> > >
> >
> > [AS] This is indeed a good point. I think that this modification in
> > behavior should affect only while adding new instances or for existing
> > instances not in RUNNING state. If update to existing instances in
> RUNNING
> > state fail, I think the current implementation of rollback should kick in
> > since otherwise, it might lead to service degradation or outage.
> >
> > Essentially, aurora update should ensure any existing instances that are
> in
> > RUNNING state should continue to be in RUNNING state. Any failure should
> > kick in rollback (unless rollback=False) but if a new instance is in
> > PENDING, or an old instance that was in PENDING is still in PENDING after
> > the update, we do not consider it to be a failure and terminate the whole
> > update operation like it is the case today.
> >
> >
> > > Thanks,
> > > Maxim
> > >
> > >
> > > On Wed, May 14, 2014 at 11:01 PM, Anindya Sinha <
> anindya.sinha@gmail.com
> > > >wrote:
> > >
> > > > Hi
> > > >
> > > > Wanted to propose a modification in handling of aurora update when
> the
> > > job
> > > > or a task cannot be scheduled immediately based on my understanding
> of
> > > job
> > > > scheduling within aurora.
> > > > Please feel free to share your comments and/or concerns.
> > > >
> > > > Thanks
> > > > Anindya
> > > >
> > > > *Scenario*
> > > > Assume we have a job with 2 RUNNING instances (say instance 0 and 1)
> in
> > > the
> > > > cluster, and then "aurora update" is issued on the same job key which
> > > bumps
> > > > up the instance count to say 5. By default, it keeps instance 0 and 1
> > > > intact, and attempts to launch 3 additional instances and waits for
> > > > UpdateConfig.watch_secs for it to be in RUNNING state before moving
> on
> > > for
> > > > each instance.
> > > >
> > > > Assume the cluster is in a state where only 1 additional instance can
> > be
> > > > launched due to resource unavailability. Hence, instance 2 is
> executed
> > > (is
> > > > in RUNNING state) and instance 3 moves to PENDING state and when
> > > > UpdateConfig.restart_threshold expires, it deems this instance to be
> a
> > > > failed instance.
> > > > If UpdateConfig.rollback_on_failure is True(default), it rolls back
> the
> > > > changes done in the update and terminates instances 2 and 3.
> > > > If UpdateConfig.rollback_on_failure is False, it does NOP keeping
> > > instances
> > > > 0 through 2 in RUNNING, and instance 3 in PENDING. Instance 4 is
> never
> > > > attempted in either of the scenarios.
> > > >
> > > > *Proposal*
> > > > I propose that in aurora update, we should consider an instance in
> > > PENDING
> > > > state after UpdateConfig.restart_threshold timeout NOT to be failed
> > case
> > > >  (and keep them in PENDING state). The reason behind this is that
> these
> > > > instances which could not be scheduled to execute at the time of
> aurora
> > > > update can be scheduled eventually in the future once there is a host
> > in
> > > > the cluster that becomes available to run these instances (based on
> > > > resource availability in the future).
> > > >
> > > > In the current approach, instance 4 is not even attempted to be
> > scheduled
> > > > since instance 3 is considered to be a failure. Further, the
> scheduling
> > > of
> > > > jobs within aurora update should ideally be treated similar to aurora
> > > > create (since in case of a aurora create with instance count=5, we
> > would
> > > > have 3 RUNNING instances and 2 instances in PENDING state assuming
> the
> > > > cluster is in a similar state).
> > > >
> > > > UpdateConfig.rollback_on_failure=False does not address the above use
> > > case
> > > > for all scenarios since:
> > > > a) It works if the PENDING instance is the last instance to be
> > launched,
> > > > but fails if there are additional instances to be launched (as in the
> > > > example above).
> > > > b) It disables rollback which may not be desirable for "real"
> failures
> > to
> > > > launch tasks in the cluster.
> > > >
> > > > Here is a JIRA that references this issue (which contains the same
> > > details
> > > > as in this email though):
> > > > Reference: https://issues.apache.org/jira/browse/AURORA-413
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message