aurora-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mangirish Wagle <vaglomangir...@gmail.com>
Subject Re: Need inputs on scheduling
Date Tue, 18 Oct 2016 04:43:05 GMT
Hi Stephan,

Thank you very much for those insights. So if I understand it correctly,
the idea here is that the MPI job would be distributed across multiple
Aurora job instances, instead of multiple machines. Also all the MPI jobs
should be scheduled together as one entity (gang scheduling).

One of the Mesos developer pointed me out to a gang scheduler
implementation: https://github.com/nqn/gasc
What I understand is, this gang schedules an MPI job directly over mesos. I
need to understand how advantageous would it be to have Aurora as a backed
for gang scheduling instead of bare mesos? One advantage is Aurora is
tested to be robust and fault tolerant framework over mesos, whereas the
later approach would call for implementing these performance criteria.

Please let me know if you have any more thoughts.

Thanks and Regards,
Mangirish Wagle

On Sun, Oct 16, 2016 at 1:21 PM, Stephan Erb <serb@apache.org> wrote:

> I have used MPI briefly a couple of years ago, and from what I
> remember:
>
> MPI tends to require so-called gang scheduling where all instances of a
> job are scheduled simultaneously. Due to lacking inherent fault
> tolerance of MPI, it is common to abort the entire job (i.e. all
> instances) if a single instance fails. Furthermore, native MPI/HPC
> schedulers tend to support long queues with various fairness mechanisms
> in order to make the gang scheduling efficient.
>
> In contrast, Aurora makes the assumption that individual instances of a
> job can be scheduled and fail independently. This implies that you
> would need some external scaffolding to ensure proper gang scheduling.
> (Disclaimer: I have no idea how difficult this would be)
>
> Aurora is battle-tested. Using it as a backend of HPC/MPI scheduler
> could therefore be worthwhile if you manage to make the scaffolding
> work. In particular, because writing a scalable and fault-tolerant
> Mesos framework can be quite difficult.
>
> Best Regards,
> Stephan
>
>
> On Sa, 2016-10-15 at 12:47 -0400, Mangirish Wagle wrote:
> > Hi Santhosh,
> >
> > Thanks for your response and suggestion. Mesos-hydra is not being
> > used and
> > supported by the community anymore, from what I heard from Mesos
> > developers. But certainly it may be a potential reference to build up
> > upon.
> >
> > My most preferred option would be to use any existing schedulers like
> > Apache Aurora to run MPI. If you have any insights on that, that
> > would be
> > really helpful.
> >
> > Regards,
> > Mangirish
> >
> > On Sat, Oct 15, 2016 at 11:07 AM, Santhosh Kumar Shanmugham <
> > sshanmugham@twitter.com.invalid> wrote:
> >
> > >
> > > Have you checked out https://github.com/mesosphere/mesos-hydra?
> > >
> > > On Oct 14, 2016 6:08 PM, "Mangirish Wagle" <vaglomangirish@gmail.co
> > > m>
> > > wrote:
> > >
> > > >
> > > > Thanks for your response Zameer. I shall check out Apache Aurora
> > > > and
> > > update
> > > >
> > > > if it served the purpose.
> > > >
> > > > On Fri, Oct 14, 2016 at 2:01 PM, Zameer Manji <zmanji@apache.org>
> > > > wrote:
> > > >
> > > > >
> > > > > Hey,
> > > > >
> > > > > I am not an expert on MPI jobs, but it seems possible to run
> > > > > them on
> > > > > Aurora. Aurora is a pretty flexible scheduler that lets you run
> > > arbitrary
> > > >
> > > > >
> > > > > binaries or container images. Aurora is designed for long
> > > > > running
> > > > services
> > > > >
> > > > > and assuming that you want to launch workers that are long
> > > > > running, it
> > > > > could solve your problem.
> > > > >
> > > > > On Thu, Oct 13, 2016 at 11:12 PM, Mangirish Wagle <
> > > > > vaglomangirish@gmail.com>
> > > > > wrote:
> > > > >
> > > > > >
> > > > > > Hello Aurora Devs,
> > > > > >
> > > > > > I am contributing to Apache Airavata <http://airavata.apache.
> > > > > > org/>
> > > and
> > > >
> > > > >
> > > > > >
> > > > > > currently working on extending the support for the science
> > > > > > gateways
> > > to
> > > >
> > > > >
> > > > > run
> > > > > >
> > > > > > MPI jobs on cloud based Mesos clusters.
> > > > > >
> > > > > > Is there a way I can achieve this using Apache Aurora? I
> > > > > > would really
> > > > > > appreciate if you could share info on any work already being
> > > > > > done to
> > > > > > achieve scheduling MPI jobs on Mesos.
> > > > > >
> > > > > > Thank you.
> > > > > >
> > > > > > Best Regards,
> > > > > > Mangirish Wagle
> > > > > > Graduate Student, Indiana University Bloomington
> > > > > >
> > > > > > --
> > > > > > Zameer Manji
> > > > > >
> > > > >
> > > >
> > >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message