airavata-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Pierce, Marlon" <marpi...@iu.edu>
Subject Re: Running MPI jobs on Mesos based clusters
Date Thu, 13 Oct 2016 16:13:50 GMT
BSD is ok: https://www.apache.org/legal/resolved. 

 

From: Mangirish Wagle <vaglomangirish@gmail.com>
Reply-To: "dev@airavata.apache.org" <dev@airavata.apache.org>
Date: Thursday, October 13, 2016 at 12:03 PM
To: "dev@airavata.apache.org" <dev@airavata.apache.org>
Subject: Re: Running MPI jobs on Mesos based clusters

 

Hello Devs,

I needed some advice on the license of the MPI libraries. The MPICH library that I have been
trying claims to have a "BSD Like" license (http://git.mpich.org/mpich.git/blob/HEAD:/COPYRIGHT).

I am aware that OpenMPI which uses BSD license is currently used in our application. I had
chosen to start investigating MPICH because it claims to be a highly portable and high quality
implementation of latest MPI standard, suitable to cloud based clusters.

If anyone could please advise on the acceptance of the MPICH libraries MSD Like license for
ASF, that would help.

Thank you.

Best Regards,

Mangirish Wagle

 

On Thu, Oct 6, 2016 at 1:48 AM, Mangirish Wagle <vaglomangirish@gmail.com> wrote:

Hello Devs, 

 

The network issue mentioned above now stands resolved. The problem was with the iptables had
some conflicting rules which blocked the traffic. It was resolved by simple iptables flush.

 

Here is the test MPI program running on multiple machines:-

 

[centos@mesos-slave-1 ~]$ mpiexec -f machinefile -n 2 ./mpitest

Hello world!  I am process number: 0 on host mesos-slave-1

Hello world!  I am process number: 1 on host mesos-slave-2

 

The next step is to try invoking this through framework like Marathon. However, the job submission
still does not run through Marathon. It seems to gets stuck in the 'waiting' state forever
(For example http://149.165.170.245:8080/ui/#/apps/%2Fmaw-try). Further, I notice that Marathon
is listed under 'inactive frameworks' in mesos dashboard (http://149.165.171.33:5050/#/frameworks).

 

I am trying to get this working, though any help/ clues with this would be really helpful.

 

Thanks and Regards,

Mangirish Wagle



 

On Fri, Sep 30, 2016 at 9:21 PM, Mangirish Wagle <vaglomangirish@gmail.com> wrote:

Hello Devs, 

 

I am currently running a sample MPI C program using 'mpiexec' provided by MPICH. I followed
their installation guide to install the libraries on the master and slave nodes of the mesos
cluster.

 

The approach that I am trying out here is that I am equipping the underlying nodes with MPI
handling tools and then use the Mesos framework like Marathon/ Aurora to submit jobs to run
MPI programs by invoking these tools.

 

You can potentially run an MPI program using mpiexec in the following manner:-

 

# mpiexec -f machinefile -n 2 ./mpitest

machinefile -> File which contains an inventory of machines to run the program on and number
of processes on each machine.
mpitest -> MPI program compiled in C using mpicc compiler. The program returns the process
number and he hostname of the machine running the process.
-n option indicates number of processes that it needs to spawn
Example of machinefile contents:-

 

# Entries in the format <hostname/IP>:<number of processes>

mesos-slave-1:1

mesos-slave-2:1

 

The reason for choosing slaves is that Mesos runs the jobs on slaves, managed by 'agents'
pertaining to the slaves.

 

Output of the program with '-n 1':-

 

# mpiexec -f machinefile -n 1 ./mpitest

Hello world!  I am process number: 0 on host mesos-slave-1

 

But when I try for '-n 2', I am hitting the following error:-

 

# mpiexec -f machinefile -n 2 ./mpitest

[proxy:0:1@mesos-slave-2] HYDU_sock_connect (/home/centos/mpich-3.2/src/pm/hydra/utils/sock/sock.c:172):
unable to connect from "mesos-slave-2" to "mesos-slave-1" (No route to host)

[proxy:0:1@mesos-slave-2] main (/home/centos/mpich-3.2/src/pm/hydra/pm/pmiserv/pmip.c:189):
unable to connect to server mesos-slave-1 at port 44788 (check for firewalls!)

 

It seems to not allow the program execution due to network traffic being blocked. I checked
security groups in scigap openstack for mesos-slave-1, mesos-slave-2 nodes and it is set to
'wideopen' policy. Furthermore, I tried adding explicit rules to the policies to allow all
TCP and UDP (Currently I am not sure what protocol is used underneath), even then it continues
throwing this error.

 

Any clues, suggestions, comments about the error or approach as a whole would be helpful.

 

Thanks and Regards,

Mangirish Wagle

 

Error! Filename not specified.

 

On Tue, Sep 27, 2016 at 11:23 AM, Mangirish Wagle <vaglomangirish@gmail.com> wrote:

Hello Devs, 

 

Thanks Gourav and Shameera for all the work w.r.t. setting up the Mesos-Marathon cluster on
Jetstream.

 

I am currently evaluating MPICH (http://www.mpich.org/about/overview/) to be used for launching
MPI jobs on top of mesos. MPICH version 1.2 supports Mesos based MPI scheduling. I have been
also trying to submit jobs to the cluster through Marathon. However, in either cases I am
currently facing issues which I am working to get resolved.

 

I am compiling my notes into the following google doc. You may please review and let me know
your comments, suggestions.

 

https://docs.google.com/document/d/1p_Y4Zd4I4lgt264IHspXJli3la25y6bcPcmrTD6nR8g/edit?usp=sharing

 

Thanks and Regards,

Mangirish Wagle



Error! Filename not specified.

 

On Wed, Sep 21, 2016 at 3:20 PM, Shenoy, Gourav Ganesh <goshenoy@indiana.edu> wrote:

Hi Mangirish,

 

I have set up a Mesos-Marathon cluster for you on Jetstream. I will share with you with the
cluster details in a separate email. Kindly note that there are 3 masters & 2 slaves in
this cluster. 

 

I am also working on automating this process for Jetstream (similar to Shameera’s ansible
script for EC2) and when that is ready, we can create clusters or add/remove slave machines
from the cluster.

 

Thanks and Regards,

Gourav Shenoy

 

From: Mangirish Wagle <vaglomangirish@gmail.com>
Reply-To: "dev@airavata.apache.org" <dev@airavata.apache.org>
Date: Wednesday, September 21, 2016 at 2:36 PM
To: "dev@airavata.apache.org" <dev@airavata.apache.org>
Subject: Running MPI jobs on Mesos based clusters

 

Hello All, 

 

I would like to post for everybody's awareness about the study that I am undertaking this
fall, i.e. to evaluate various different frameworks that would facilitate MPI jobs on Mesos
based clusters for Apache Airavata.

 

Some of the options that I am looking at are:-

MPI support framework bundled with Mesos
Apache Aurora
Marathon
Chronos
Some of the evaluation criteria that I am planning to base my investigation are:-

Ease of setup
Documentation
Reliability features like HA
Scaling and Fault recovery
Performance
Community Support
Gourav and Shameera are working on ansible based automation to spin up a mesos based cluster
and I am planning to use it to setup a cluster for experimentation.

 

Any suggestions or information about prior work on this would be highly appreciated.

 

Thank you.

 

Best Regards,

Mangirish Wagle

Error! Filename not specified.

 

 

 

 


Mime
View raw message