airavata-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Suresh Marru <sma...@apache.org>
Subject Re: mesos and moving jobs between clusters
Date Tue, 01 Nov 2016 15:26:20 GMT

> On Nov 1, 2016, at 9:35 AM, Madhusudhan Govindaraju <mgovinda@binghamton.edu> wrote:
> 
> 
> Hello Mangirish,
> 
> Here is the text from Aurora's github page:
> 
> ---
> When and when not to use Aurora
> Aurora can take over for most uses of software ... However, if you have very specific
scheduling requirements, or are building a system that looks like a scheduler itself, you
may want to explore developing your own framework.
> ---
> We believe Airavata will need a framework to customize scheduling policies for different
communities, and so instead of making big changes in Aurora, we want to develop our own framework.
> 
Hi Madhu, Pankaj,

New contributions and directions will be greatly welcome. Writing a new framework as a proof
of concept or as an academic effort does not need much discussion, but if the goal is to deliver
on a “production ready” scheduler then I think we need significant discussion and assessment
of what the currents schedulers lack. Job management has lot of corner cases and it will take
some collective substantial effort to work on the last 20%. I suggest the following steps
to make sure every one in the community comes along and participates:

* Start with an architecture mailing list discussion on high level goals, shortcomings in
current schedulers, why writing a new scheduler is justified over extending or contributing
to existing ones. An example thread on a related topic - http://airavata.markmail.org/thread/f3ncoxyarateyn4y
<http://airavata.markmail.org/thread/f3ncoxyarateyn4y> another on workflows - http://markmail.org/thread/tkpbj3sr4jhg6o6z
<http://markmail.org/thread/tkpbj3sr4jhg6o6z>, an example on use of Zookeeper in Airavata
- http://airavata.markmail.org/thread/sdidqqf4czprmpik <http://airavata.markmail.org/thread/sdidqqf4czprmpik>.
* Once we have a consensus on the architectural approaches, it will be great to do a design
discussion on airavata dev list. 
* Develop  the scheduler from scratch on the mailing lists and constantly seek inputs and
early users to try. The onus is on the contributor to some how interest from the community.


I can understand how laborious all of this sounds, but there are many dormant observers on
dev and architecture lists and a good topic awakens them. Airavata strives on such volunteer
intellectual contributions which is above and beyond the direct code contributions. 

> Once you or Gourav-Shenoy have Airavata working with Aurora/Mesos, the idea is that Pankaj
will work with you to use the same codebase/task-module in Airavata to launch jobs on Mesos
using a custom framework.  
> 
The thrift client to Aurora is in a working state - https://github.com/apache/airavata/tree/develop/modules/cloud/aurora-client
<https://github.com/apache/airavata/tree/develop/modules/cloud/aurora-client> the integration
with Airavata is on two ends. Have the statuses pushed into registry (this should be ready
by Thursday - I plan to demo it at the gateways workshop). There may be some hard wirings
on Aurora end points and so on, which we need to integrate with App Catalog. This might have
to wait until we gain better understand on Aurora. 

Suresh
> -Madhu
> 
> 
> On 10/28/2016 12:46 PM, Mangirish Wagle wrote:
>> Hi Pankaj,
>> 
>> I was curious to know what is your motivation to work on developing a custom framework
and not use Aurora or any existing robust frameworks. It would be great if you could share
some pointers on that.
>> I would also like to know what specific use cases you are targeting through your
framework, as well as what are various stability concerns that you may have identified and
how are you planning to handle them?
>> 
>> Regards,
>> Mangirish
>> 
>> 
>> 
>> 
>> 
>> On Tue, Oct 25, 2016 at 6:09 PM, Pankaj Saha <psaha4@binghamton.edu <mailto:psaha4@binghamton.edu>>
wrote:
>> Hi Mark,
>> 
>> Mesos collects the resource information from all the nodes in the cluster (cores,
memory, disk, and gpu) and presents a unified view, as if it is a single operating system.
The Mesosphere, who a commercial entity for Mesos, has built an ecosystem around Mesos as
the kernel called the "Data Center Operating System (DCOS)".  Frameworks interact  with Mesos
to reserve resources and then use these resources to run jobs on the cluster. So, for example,
if multiple frameworks such as Marathon, Apache Aurora, and a custom-MPI-framework are using
Mesos, then there is a negotiation between Mesos and each framework on how many resources
each framework gets. Once the framework, say Aurora, gets resources, it can decide how to
use those resources. Some of the strengths of Mesos include fault tolerance at sca
>>  l
>> e and the ability to co-schedule applications/frameworks on the cluster such that
cluster utilization is high.
>> 
>> Mesos off-the-shelf only works when the Mater and agent nodes have a line of communication
to each other. We have worked on modifying the Mesos installation so that it even works when
agents are behind firewalls on campus clusters. We are also working on getting the same setup
to work on Jetstream and Chameleon where allocations are a mix of public IPs and internally
accessible nodes. This will allow us to use Mesos to meta-schedule across clusters. We are
also developing our own framework, to be able to customize scheduling and resource negotiations
for science gateways on Mesos clusters. Our plan is to work with Suresh and Marlon's team
so that it works with Airavata.  
>> 
>> I will be presenting at the Gateways workshop in November, and then I will also be
at SC along with my adviser (Madhu Govindaraju), if you would like to discuss any of these
projects.
>> 
>> We are working on packaging our work so that it can be shared with this community.
>> 
>> Thanks
>> Pankaj
>> 
>> On Tue, Oct 25, 2016 at 11:36 AM, Mangirish Wagle < <mailto:vaglomangirish@gmail.com>vaglomangirish@gmail.com
<mailto:vaglomangirish@gmail.com>> wrote:
>> Hi Mark,
>> 
>> Thanks for your question. So if I understand you correctly, you need kind of load
balancing between identical clusters through a single Mesos master?
>> 
>> With the current setup, from what I understand, we have a separate mesos masters
for every cluster on separate clouds. However, its a good investigative topic if we can have
single mesos master targeting multiple identical clusters. We have some work ongoing to use
a virtual cluster setup with compute resources across clouds to install mesos, but not sure
if that is what you are looking for.
>> 
>> Regards,
>> Mangirish
>> 
>> 
>> 
>> 
>> 
>> On Tue, Oct 25, 2016 at 11:05 AM, Miller, Mark <mmiller@sdsc.edu <mailto:mmiller@sdsc.edu>>
wrote:
>> Hi all,
>> 
>>  
>> I posed a question to Suresh (see below), and he asked me to put this question on
the dev list.
>> 
>> So here it is. I will be grateful for any comments about the issues you all are facing,
and what has come up in trying this, as
>> 
>> It seems likely that this is a much simpler problem in concept than it is in practice,
but its solution has many benefits.
>> 
>>  
>> Here is my question:
>> 
>> A group of us have been discussing how we might simplify submitting jobs to different
compute resources in our current implementation of CIPRES, and how cloud computing might facilitate
this. But none of us are cloud experts. As I understand it, the mesos cluster that I have
been seeing in the Airavata email threads is intended to make it possible to deploy jobs to
multiple virtual clusters. I am (we are) wondering if Mesos manages submissions to identical
virtual clusters on multiple machines, and if that works efficiently. 
>> 
>>  
>> In our implementation, we have to change the rules to run efficiently on different
machines, according to gpu availability, and cores per node. I am wondering how Mesos/ virtual
clusters affect those considerations.
>> 
>> Can mesos create basically identical virtual clusters independent of machine?
>> 
>> 
>> Thanks for any advice.
>> 
>>  
>> Mark
>> 
>>  
>>  
>>  
>>  
>> 
>> 
>> 
> 


Mime
View raw message