aurora-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Zameer Manji <zma...@apache.org>
Subject Re: schedule task instances spreading them based on a host attribute.
Date Thu, 30 Mar 2017 18:35:08 GMT
Rick,

Can you share why it would be nice to spread out these different jobs on
different hosts? Is it for reliability, performance, utilization, etc?

On Thu, Mar 30, 2017 at 11:31 AM, Rick Mangi <rick@chartbeat.com> wrote:

> Yeah, we have a dozen or so kafka consumer jobs running in our cluster,
> each having about 40 or so instances.
>
>
> > On Mar 30, 2017, at 2:06 PM, David McLaughlin <david@dmclaughlin.com>
> wrote:
> >
> > There is absolutely a need for custom hook points in the scheduler
> (injecting default constraints to running tasks for example). I don't think
> users should be asked to write custom scheduling algorithms to solve the
> problems in this thread though. There are also huge downsides to exposing
> the internals of scheduling as a part of a plugin API.
> >
> > Out of curiosity do your Kafka consumers span multiple jobs? Otherwise
> host constraints solve that problem right?
> >
> >> On Mar 30, 2017, at 10:34 AM, Rick Mangi <rick@chartbeat.com> wrote:
> >>
> >> I think the complexity is a great rationale for having a pluggable
> scheduling layer. Aurora is very flexible and people use it in many
> different ways. Giving users more flexibility in how jobs are scheduled
> seems like it would be a good direction for the project.
> >>
> >>
> >>> On Mar 30, 2017, at 12:16 PM, David McLaughlin <dmclaughlin@apache.org>
> wrote:
> >>>
> >>> I think this is more complicated than multiple scheduling algorithms.
> The
> >>> problem you'll end up having if you try to solve this in the Scheduling
> >>> loop is when resources are unavailable because there are preemptible
> tasks
> >>> running in them, rather than hosts being down. Right now the fact that
> the
> >>> task cannot be scheduled is important because it triggers preemption
> and
> >>> will make room. An alternative algorithm that tries at all costs to
> >>> schedule the task in the TaskAssigner could decide to place the task
> in a
> >>> non-ideal slot and leave a preemptible task running instead.
> >>>
> >>> It's also important to think of the knock-on effects here when we move
> to
> >>> offer affinity (i.e. the current Dynamic Reservation proposal). If
> you've
> >>> made this non-ideal compromise to get things scheduled - that decision
> will
> >>> basically be permanent until the host you're on goes down. At least
> with
> >>> how things work now, with each scheduling attempt the job has a fresh
> >>> chance of being put in an ideal slot.
> >>>
> >>>> On Thu, Mar 30, 2017 at 8:12 AM, Rick Mangi <rick@chartbeat.com>
> wrote:
> >>>>
> >>>> Sorry for the late reply, but I wanted to chime in here as wanting to
> see
> >>>> this feature. We run a medium size cluster (around 1000 cores) in EC2
> and I
> >>>> think we could get better usage of the cluster with more control over
> the
> >>>> distribution of job instances. For example it would be nice to limit
> the
> >>>> number of kafka consumers running on the same physical box.
> >>>>
> >>>> Best,
> >>>>
> >>>> Rick
> >>>>
> >>>>
> >>>>> On 2017-03-06 14:44 (-0400), Mauricio Garavaglia <m...@gmail.com>
> wrote:
> >>>>> Hello!>
> >>>>>
> >>>>> I have a job that have multiple instances (>100) that'd I like
to
> spread>
> >>>>> across the hosts in a cluster. Using a constraint such as
> "limit=host:1">
> >>>>> doesn't work quite well, as I have more instances than nodes.>
> >>>>>
> >>>>> As a workaround I increased the limit value to something like>
> >>>>> ceil(instances/nodes). But now the problem happens if a bunch of
> nodes
> >>>> go>
> >>>>> down (think a whole rack dies) because the instances will not run
> until>
> >>>>> them are back, even though we may have spare capacity on the rest
of
> the>
> >>>>> hosts that we'd like to use. In that scenario, the job availability
> may
> >>>> be>
> >>>>> affected because it's running with fewer instances than expected.
On
> a>
> >>>>> smaller scale, the former approach would also apply if you want
to
> >>>> spread>
> >>>>> tasks in racks or availability zones. I'd like to have one instance
> of a>
> >>>>> job per rack (failure domain) but in the case of it going down,
the>
> >>>>> instance can be spawn on a different rack.>
> >>>>>
> >>>>> I thought we could have a scheduling constraint to "spread"
> instances>
> >>>>> across a particular host attribute; instead of vetoing an offer
right
> >>>> away>
> >>>>> we check where the other instances of a task are running, looking
> for a>
> >>>>> particular attribute of the host. We try to maximize the different
> >>>> values>
> >>>>> of a particular attribute (rack, hostname, etc) on the task
> instances>
> >>>>> assignment.>
> >>>>>
> >>>>> what do you think? did something like this came up in the past?
is
> it>
> >>>>> feasible?>
> >>>>>
> >>>>>
> >>>>> Mauricio>
> >>>>>
> >>>>
> >>
>
> --
> Zameer Manji
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message