airavata-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mangirish Wagle <vaglomangir...@gmail.com>
Subject Re: Running MPI jobs on Mesos based clusters
Date Sat, 01 Oct 2016 01:21:51 GMT
Hello Devs,

I am currently running a sample MPI C program using 'mpiexec' provided by
MPICH. I followed their installation guide
<http://www.mpich.org/static/downloads/3.2/mpich-3.2-installguide.pdf> to
install the libraries on the master and slave nodes of the mesos cluster.

The approach that I am trying out here is that I am equipping the
underlying nodes with MPI handling tools and then use the Mesos framework
like Marathon/ Aurora to submit jobs to run MPI programs by invoking these
tools.

You can potentially run an MPI program using mpiexec in the following
manner:-

# *mpiexec -f machinefile -n 2 ./mpitest*

   - *machinefile *-> File which contains an inventory of machines to run
   the program on and number of processes on each machine.
   - *mpitest *-> MPI program compiled in C using mpicc compiler. The
   program returns the process number and he hostname of the machine running
   the process.
   - *-n *option indicates number of processes that it needs to spawn

Example of machinefile contents:-

# Entries in the format <hostname/IP>:<number of processes>
mesos-slave-1:1
mesos-slave-2:1

The reason for choosing slaves is that Mesos runs the jobs on slaves,
managed by 'agents' pertaining to the slaves.

Output of the program with '-n 1':-

# mpiexec -f machinefile -n 1 ./mpitest
Hello world!  I am process number: 0 on host mesos-slave-1

But when I try for '-n 2', I am hitting the following error:-

# mpiexec -f machinefile -n 2 ./mpitest
[proxy:0:1@mesos-slave-2] HYDU_sock_connect
(/home/centos/mpich-3.2/src/pm/hydra/utils/sock/sock.c:172): unable to
connect from "mesos-slave-2" to "mesos-slave-1" (No route to host)
[proxy:0:1@mesos-slave-2] main
(/home/centos/mpich-3.2/src/pm/hydra/pm/pmiserv/pmip.c:189): *unable to
connect to server mesos-slave-1 at port 44788* (check for firewalls!)

It seems to not allow the program execution due to network traffic being
blocked. I checked security groups in scigap openstack for mesos-slave-1,
mesos-slave-2 nodes and it is set to 'wideopen' policy. Furthermore, I
tried adding explicit rules to the policies to allow all TCP and UDP
(Currently I am not sure what protocol is used underneath), even then it
continues throwing this error.

Any clues, suggestions, comments about the error or approach as a whole
would be helpful.

Thanks and Regards,
Mangirish Wagle


On Tue, Sep 27, 2016 at 11:23 AM, Mangirish Wagle <vaglomangirish@gmail.com>
wrote:

> Hello Devs,
>
> Thanks Gourav and Shameera for all the work w.r.t. setting up the
> Mesos-Marathon cluster on Jetstream.
>
> I am currently evaluating MPICH (http://www.mpich.org/about/overview/) to
> be used for launching MPI jobs on top of mesos. MPICH version 1.2 supports
> Mesos based MPI scheduling. I have been also trying to submit jobs to the
> cluster through Marathon. However, in either cases I am currently facing
> issues which I am working to get resolved.
>
> I am compiling my notes into the following google doc. You may please
> review and let me know your comments, suggestions.
>
> https://docs.google.com/document/d/1p_Y4Zd4I4lgt264IHspXJli3la25y6bc
> PcmrTD6nR8g/edit?usp=sharing
>
> Thanks and Regards,
> Mangirish Wagle
>
>
>
> On Wed, Sep 21, 2016 at 3:20 PM, Shenoy, Gourav Ganesh <
> goshenoy@indiana.edu> wrote:
>
>> Hi Mangirish,
>>
>>
>>
>> I have set up a Mesos-Marathon cluster for you on Jetstream. I will share
>> with you with the cluster details in a separate email. Kindly note that
>> there are 3 masters & 2 slaves in this cluster.
>>
>>
>>
>> I am also working on automating this process for Jetstream (similar to
>> Shameera’s ansible script for EC2) and when that is ready, we can create
>> clusters or add/remove slave machines from the cluster.
>>
>>
>>
>> Thanks and Regards,
>>
>> Gourav Shenoy
>>
>>
>>
>> *From: *Mangirish Wagle <vaglomangirish@gmail.com>
>> *Reply-To: *"dev@airavata.apache.org" <dev@airavata.apache.org>
>> *Date: *Wednesday, September 21, 2016 at 2:36 PM
>> *To: *"dev@airavata.apache.org" <dev@airavata.apache.org>
>> *Subject: *Running MPI jobs on Mesos based clusters
>>
>>
>>
>> Hello All,
>>
>>
>>
>> I would like to post for everybody's awareness about the study that I am
>> undertaking this fall, i.e. to evaluate various different frameworks that
>> would facilitate MPI jobs on Mesos based clusters for Apache Airavata.
>>
>>
>>
>> Some of the options that I am looking at are:-
>>
>>    1. MPI support framework bundled with Mesos
>>    2. Apache Aurora
>>    3. Marathon
>>    4. Chronos
>>
>> Some of the evaluation criteria that I am planning to base my
>> investigation are:-
>>
>>    - Ease of setup
>>    - Documentation
>>    - Reliability features like HA
>>    - Scaling and Fault recovery
>>    - Performance
>>    - Community Support
>>
>> Gourav and Shameera are working on ansible based automation to spin up a
>> mesos based cluster and I am planning to use it to setup a cluster for
>> experimentation.
>>
>>
>>
>> Any suggestions or information about prior work on this would be highly
>> appreciated.
>>
>>
>>
>> Thank you.
>>
>>
>>
>> Best Regards,
>>
>> Mangirish Wagle
>>
>>
>

Mime
View raw message