airavata-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mangirish Wagle <vaglomangir...@gmail.com>
Subject Re: Running MPI jobs on Mesos based clusters
Date Thu, 13 Oct 2016 16:39:12 GMT
Hi Marlon,
Thanks for confirming and sharing the legal link.

-Mangirish

On Thu, Oct 13, 2016 at 12:13 PM, Pierce, Marlon <marpierc@iu.edu> wrote:

> BSD is ok: https://www.apache.org/legal/resolved.
>
>
>
> *From: *Mangirish Wagle <vaglomangirish@gmail.com>
> *Reply-To: *"dev@airavata.apache.org" <dev@airavata.apache.org>
> *Date: *Thursday, October 13, 2016 at 12:03 PM
> *To: *"dev@airavata.apache.org" <dev@airavata.apache.org>
> *Subject: *Re: Running MPI jobs on Mesos based clusters
>
>
>
> Hello Devs,
>
> I needed some advice on the license of the MPI libraries. The MPICH
> library that I have been trying claims to have a "BSD Like" license (
> http://git.mpich.org/mpich.git/blob/HEAD:/COPYRIGHT).
>
> I am aware that OpenMPI which uses BSD license is currently used in our
> application. I had chosen to start investigating MPICH because it claims to
> be a highly portable and high quality implementation of latest MPI
> standard, suitable to cloud based clusters.
>
> If anyone could please advise on the acceptance of the MPICH libraries MSD
> Like license for ASF, that would help.
>
> Thank you.
>
> Best Regards,
>
> Mangirish Wagle
>
>
>
> On Thu, Oct 6, 2016 at 1:48 AM, Mangirish Wagle <vaglomangirish@gmail.com>
> wrote:
>
> Hello Devs,
>
>
>
> The network issue mentioned above now stands resolved. The problem was
> with the iptables had some conflicting rules which blocked the traffic. It
> was resolved by simple iptables flush.
>
>
>
> Here is the test MPI program running on multiple machines:-
>
>
>
> [centos@mesos-slave-1 ~]$ mpiexec -f machinefile -n 2 ./mpitest
>
> Hello world!  I am process number: 0 on host mesos-slave-1
>
> Hello world!  I am process number: 1 on host mesos-slave-2
>
>
>
> The next step is to try invoking this through framework like Marathon.
> However, the job submission still does not run through Marathon. It seems
> to gets stuck in the 'waiting' state forever (For example
> http://149.165.170.245:8080/ui/#/apps/%2Fmaw-try). Further, I notice that
> Marathon is listed under 'inactive frameworks' in mesos dashboard (
> http://149.165.171.33:5050/#/frameworks).
>
>
>
> I am trying to get this working, though any help/ clues with this would be
> really helpful.
>
>
>
> Thanks and Regards,
>
> Mangirish Wagle
>
>
>
>
> On Fri, Sep 30, 2016 at 9:21 PM, Mangirish Wagle <vaglomangirish@gmail.com>
> wrote:
>
> Hello Devs,
>
>
>
> I am currently running a sample MPI C program using 'mpiexec' provided by
> MPICH. I followed their installation guide
> <http://www.mpich.org/static/downloads/3.2/mpich-3.2-installguide.pdf> to
> install the libraries on the master and slave nodes of the mesos cluster.
>
>
>
> The approach that I am trying out here is that I am equipping the
> underlying nodes with MPI handling tools and then use the Mesos framework
> like Marathon/ Aurora to submit jobs to run MPI programs by invoking these
> tools.
>
>
>
> You can potentially run an MPI program using mpiexec in the following
> manner:-
>
>
>
> # *mpiexec -f machinefile -n 2 ./mpitest*
>
>    - *machinefile *-> File which contains an inventory of machines to run
>    the program on and number of processes on each machine.
>    - *mpitest *-> MPI program compiled in C using mpicc compiler. The
>    program returns the process number and he hostname of the machine running
>    the process.
>    - *-n *option indicates number of processes that it needs to spawn
>
> Example of machinefile contents:-
>
>
>
> # Entries in the format <hostname/IP>:<number of processes>
>
> mesos-slave-1:1
>
> mesos-slave-2:1
>
>
>
> The reason for choosing slaves is that Mesos runs the jobs on slaves,
> managed by 'agents' pertaining to the slaves.
>
>
>
> Output of the program with '-n 1':-
>
>
>
> # mpiexec -f machinefile -n 1 ./mpitest
>
> Hello world!  I am process number: 0 on host mesos-slave-1
>
>
>
> But when I try for '-n 2', I am hitting the following error:-
>
>
>
> # mpiexec -f machinefile -n 2 ./mpitest
>
> [proxy:0:1@mesos-slave-2] HYDU_sock_connect (/home/centos/mpich-3.2/src/
> pm/hydra/utils/sock/sock.c:172): unable to connect from "mesos-slave-2"
> to "mesos-slave-1" (No route to host)
>
> [proxy:0:1@mesos-slave-2] main (/home/centos/mpich-3.2/src/
> pm/hydra/pm/pmiserv/pmip.c:189): *unable to connect to server
> mesos-slave-1 at port 44788* (check for firewalls!)
>
>
>
> It seems to not allow the program execution due to network traffic being
> blocked. I checked security groups in scigap openstack for mesos-slave-1,
> mesos-slave-2 nodes and it is set to 'wideopen' policy. Furthermore, I
> tried adding explicit rules to the policies to allow all TCP and UDP
> (Currently I am not sure what protocol is used underneath), even then it
> continues throwing this error.
>
>
>
> Any clues, suggestions, comments about the error or approach as a whole
> would be helpful.
>
>
>
> Thanks and Regards,
>
> Mangirish Wagle
>
>
>
> *Error! Filename not specified.*
>
>
>
> On Tue, Sep 27, 2016 at 11:23 AM, Mangirish Wagle <
> vaglomangirish@gmail.com> wrote:
>
> Hello Devs,
>
>
>
> Thanks Gourav and Shameera for all the work w.r.t. setting up the
> Mesos-Marathon cluster on Jetstream.
>
>
>
> I am currently evaluating MPICH (http://www.mpich.org/about/overview/) to
> be used for launching MPI jobs on top of mesos. MPICH version 1.2 supports
> Mesos based MPI scheduling. I have been also trying to submit jobs to the
> cluster through Marathon. However, in either cases I am currently facing
> issues which I am working to get resolved.
>
>
>
> I am compiling my notes into the following google doc. You may please
> review and let me know your comments, suggestions.
>
>
>
> https://docs.google.com/document/d/1p_Y4Zd4I4lgt264IHspXJli3la25y6bc
> PcmrTD6nR8g/edit?usp=sharing
>
>
>
> Thanks and Regards,
>
> Mangirish Wagle
>
>
>
> *Error! Filename not specified.*
>
>
>
> On Wed, Sep 21, 2016 at 3:20 PM, Shenoy, Gourav Ganesh <
> goshenoy@indiana.edu> wrote:
>
> Hi Mangirish,
>
>
>
> I have set up a Mesos-Marathon cluster for you on Jetstream. I will share
> with you with the cluster details in a separate email. Kindly note that
> there are 3 masters & 2 slaves in this cluster.
>
>
>
> I am also working on automating this process for Jetstream (similar to
> Shameera’s ansible script for EC2) and when that is ready, we can create
> clusters or add/remove slave machines from the cluster.
>
>
>
> Thanks and Regards,
>
> Gourav Shenoy
>
>
>
> *From: *Mangirish Wagle <vaglomangirish@gmail.com>
> *Reply-To: *"dev@airavata.apache.org" <dev@airavata.apache.org>
> *Date: *Wednesday, September 21, 2016 at 2:36 PM
> *To: *"dev@airavata.apache.org" <dev@airavata.apache.org>
> *Subject: *Running MPI jobs on Mesos based clusters
>
>
>
> Hello All,
>
>
>
> I would like to post for everybody's awareness about the study that I am
> undertaking this fall, i.e. to evaluate various different frameworks that
> would facilitate MPI jobs on Mesos based clusters for Apache Airavata.
>
>
>
> Some of the options that I am looking at are:-
>
>    1. MPI support framework bundled with Mesos
>    2. Apache Aurora
>    3. Marathon
>    4. Chronos
>
> Some of the evaluation criteria that I am planning to base my
> investigation are:-
>
>    - Ease of setup
>    - Documentation
>    - Reliability features like HA
>    - Scaling and Fault recovery
>    - Performance
>    - Community Support
>
> Gourav and Shameera are working on ansible based automation to spin up a
> mesos based cluster and I am planning to use it to setup a cluster for
> experimentation.
>
>
>
> Any suggestions or information about prior work on this would be highly
> appreciated.
>
>
>
> Thank you.
>
>
>
> Best Regards,
>
> Mangirish Wagle
>
> *Error! Filename not specified.*
>
>
>
>
>
>
>
>
>

Mime
View raw message