Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id D9E5D200B9D for ; Thu, 13 Oct 2016 18:39:24 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id D87C6160AD2; Thu, 13 Oct 2016 16:39:24 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 82A1E160AF6 for ; Thu, 13 Oct 2016 18:39:23 +0200 (CEST) Received: (qmail 38038 invoked by uid 500); 13 Oct 2016 16:39:22 -0000 Mailing-List: contact dev-help@airavata.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@airavata.apache.org Delivered-To: mailing list dev@airavata.apache.org Received: (qmail 37412 invoked by uid 99); 13 Oct 2016 16:39:22 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd4-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 13 Oct 2016 16:39:22 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd4-us-west.apache.org (ASF Mail Server at spamd4-us-west.apache.org) with ESMTP id ED49AC03BC for ; Thu, 13 Oct 2016 16:39:21 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd4-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 5.581 X-Spam-Level: ***** X-Spam-Status: No, score=5.581 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=2, KAM_BADIPHTTP=2, KAM_LINEPADDING=1.2, NORMAL_HTTP_TO_IP=0.001, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, RCVD_IN_SORBS_SPAM=0.5, SPF_PASS=-0.001, WEIRD_PORT=0.001] autolearn=disabled Authentication-Results: spamd4-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd4-us-west.apache.org [10.40.0.11]) (amavisd-new, port 10024) with ESMTP id JI_WWiFFNwrZ for ; Thu, 13 Oct 2016 16:39:18 +0000 (UTC) Received: from mail-qk0-f182.google.com (mail-qk0-f182.google.com [209.85.220.182]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTPS id CACB35F1F3 for ; Thu, 13 Oct 2016 16:39:17 +0000 (UTC) Received: by mail-qk0-f182.google.com with SMTP id o68so146280229qkf.3 for ; Thu, 13 Oct 2016 09:39:17 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:from:date:message-id:subject:to; bh=OABSXI8zyr+bA1hiOBo7PGlnNRCcEJcTCg5oyJTPYos=; b=S71a/sgqtXUae5ND+9s87ffbgDFAbDGzq/pJbfCCYtPSU/kkNMJFRpJjkb6Sjb18s7 ejbu2oYDG8OjIN+BhFy669U7FlqscxSpmlGwX+vJrhXxFwIlVVNI/ikGsDKulmIiwq6n ELH4fKWR4EnveCo7ERX8Qqh/K6iNkZxPQ9/FeH0Ta4EoIgm8kVZ25Enx/J+m759A9C9g Asr/97vCT+prA/XBeQzIo4Ho4qnqnFq2yaTXIvq0T7bBXlMasl3ueaXv353LQTwCEnSd roT9qbv/nwDTJMivc+gcr6cbDzzqE6/pXVNBG00kMTXe+XAlOevcnVJ80IvhHMJbTzey 40ig== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to; bh=OABSXI8zyr+bA1hiOBo7PGlnNRCcEJcTCg5oyJTPYos=; b=hFUHb+iCeZbq9N4ijG31fnx30qr8nijaX23x1RRf66nRY8KJP5Y5ykHLROA3wk3TZU E5yiJGO+LAy7q/+YzE4ryXJVAbp6VWYDPKFiwHSsb/6QrVSRP9qzuoHsLsqZBBNCZKHq azQy0XFhXzgSquLmmZZSPi+8ihAMzIdlsTEphtq+UuQ4YDT4DG67bfq+mBvp7dtKHS2c eYE3L/HAuBdRAf7wLgpVy8KgesLksfYej0f1ARWt5lMK4UT+l2qwMphT9trvba9B7Jrg pvXpeby/JuqGaoicGmvahZysLgWBNDnn5jSdvL6h5ReUs3J0ToXZx7KMvo4/KocmglsY K6zA== X-Gm-Message-State: AA6/9RkIvnNieudJ+9x6bUbzpqA/x3lRkTFP40gtL1S8E9qAmB8lltoaSzI6I0rmyAOD4Ue4NyrDWwdw/XUD9g== X-Received: by 10.194.134.161 with SMTP id pl1mr7674735wjb.81.1476376753269; Thu, 13 Oct 2016 09:39:13 -0700 (PDT) MIME-Version: 1.0 Received: by 10.80.136.85 with HTTP; Thu, 13 Oct 2016 09:39:12 -0700 (PDT) In-Reply-To: <3866C89E-6864-444E-82D8-54BB6421F2AF@iu.edu> References: <3866C89E-6864-444E-82D8-54BB6421F2AF@iu.edu> From: Mangirish Wagle Date: Thu, 13 Oct 2016 12:39:12 -0400 Message-ID: Subject: Re: Running MPI jobs on Mesos based clusters To: dev@airavata.apache.org Content-Type: multipart/alternative; boundary=089e01228c12eef68b053ec1bf40 archived-at: Thu, 13 Oct 2016 16:39:25 -0000 --089e01228c12eef68b053ec1bf40 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Hi Marlon, Thanks for confirming and sharing the legal link. -Mangirish On Thu, Oct 13, 2016 at 12:13 PM, Pierce, Marlon wrote: > BSD is ok: https://www.apache.org/legal/resolved. > > > > *From: *Mangirish Wagle > *Reply-To: *"dev@airavata.apache.org" > *Date: *Thursday, October 13, 2016 at 12:03 PM > *To: *"dev@airavata.apache.org" > *Subject: *Re: Running MPI jobs on Mesos based clusters > > > > Hello Devs, > > I needed some advice on the license of the MPI libraries. The MPICH > library that I have been trying claims to have a "BSD Like" license ( > http://git.mpich.org/mpich.git/blob/HEAD:/COPYRIGHT). > > I am aware that OpenMPI which uses BSD license is currently used in our > application. I had chosen to start investigating MPICH because it claims = to > be a highly portable and high quality implementation of latest MPI > standard, suitable to cloud based clusters. > > If anyone could please advise on the acceptance of the MPICH libraries MS= D > Like license for ASF, that would help. > > Thank you. > > Best Regards, > > Mangirish Wagle > > > > On Thu, Oct 6, 2016 at 1:48 AM, Mangirish Wagle > wrote: > > Hello Devs, > > > > The network issue mentioned above now stands resolved. The problem was > with the iptables had some conflicting rules which blocked the traffic. I= t > was resolved by simple iptables flush. > > > > Here is the test MPI program running on multiple machines:- > > > > [centos@mesos-slave-1 ~]$ mpiexec -f machinefile -n 2 ./mpitest > > Hello world! I am process number: 0 on host mesos-slave-1 > > Hello world! I am process number: 1 on host mesos-slave-2 > > > > The next step is to try invoking this through framework like Marathon. > However, the job submission still does not run through Marathon. It seems > to gets stuck in the 'waiting' state forever (For example > http://149.165.170.245:8080/ui/#/apps/%2Fmaw-try). Further, I notice that > Marathon is listed under 'inactive frameworks' in mesos dashboard ( > http://149.165.171.33:5050/#/frameworks). > > > > I am trying to get this working, though any help/ clues with this would b= e > really helpful. > > > > Thanks and Regards, > > Mangirish Wagle > > > > > On Fri, Sep 30, 2016 at 9:21 PM, Mangirish Wagle > wrote: > > Hello Devs, > > > > I am currently running a sample MPI C program using 'mpiexec' provided by > MPICH. I followed their installation guide > to > install the libraries on the master and slave nodes of the mesos cluster. > > > > The approach that I am trying out here is that I am equipping the > underlying nodes with MPI handling tools and then use the Mesos framework > like Marathon/ Aurora to submit jobs to run MPI programs by invoking thes= e > tools. > > > > You can potentially run an MPI program using mpiexec in the following > manner:- > > > > # *mpiexec -f machinefile -n 2 ./mpitest* > > - *machinefile *-> File which contains an inventory of machines to run > the program on and number of processes on each machine. > - *mpitest *-> MPI program compiled in C using mpicc compiler. The > program returns the process number and he hostname of the machine runn= ing > the process. > - *-n *option indicates number of processes that it needs to spawn > > Example of machinefile contents:- > > > > # Entries in the format : > > mesos-slave-1:1 > > mesos-slave-2:1 > > > > The reason for choosing slaves is that Mesos runs the jobs on slaves, > managed by 'agents' pertaining to the slaves. > > > > Output of the program with '-n 1':- > > > > # mpiexec -f machinefile -n 1 ./mpitest > > Hello world! I am process number: 0 on host mesos-slave-1 > > > > But when I try for '-n 2', I am hitting the following error:- > > > > # mpiexec -f machinefile -n 2 ./mpitest > > [proxy:0:1@mesos-slave-2] HYDU_sock_connect (/home/centos/mpich-3.2/src/ > pm/hydra/utils/sock/sock.c:172): unable to connect from "mesos-slave-2" > to "mesos-slave-1" (No route to host) > > [proxy:0:1@mesos-slave-2] main (/home/centos/mpich-3.2/src/ > pm/hydra/pm/pmiserv/pmip.c:189): *unable to connect to server > mesos-slave-1 at port 44788* (check for firewalls!) > > > > It seems to not allow the program execution due to network traffic being > blocked. I checked security groups in scigap openstack for mesos-slave-1, > mesos-slave-2 nodes and it is set to 'wideopen' policy. Furthermore, I > tried adding explicit rules to the policies to allow all TCP and UDP > (Currently I am not sure what protocol is used underneath), even then it > continues throwing this error. > > > > Any clues, suggestions, comments about the error or approach as a whole > would be helpful. > > > > Thanks and Regards, > > Mangirish Wagle > > > > *Error! Filename not specified.* > > > > On Tue, Sep 27, 2016 at 11:23 AM, Mangirish Wagle < > vaglomangirish@gmail.com> wrote: > > Hello Devs, > > > > Thanks Gourav and Shameera for all the work w.r.t. setting up the > Mesos-Marathon cluster on Jetstream. > > > > I am currently evaluating MPICH (http://www.mpich.org/about/overview/) to > be used for launching MPI jobs on top of mesos. MPICH version 1.2 support= s > Mesos based MPI scheduling. I have been also trying to submit jobs to the > cluster through Marathon. However, in either cases I am currently facing > issues which I am working to get resolved. > > > > I am compiling my notes into the following google doc. You may please > review and let me know your comments, suggestions. > > > > https://docs.google.com/document/d/1p_Y4Zd4I4lgt264IHspXJli3la25y6bc > PcmrTD6nR8g/edit?usp=3Dsharing > > > > Thanks and Regards, > > Mangirish Wagle > > > > *Error! Filename not specified.* > > > > On Wed, Sep 21, 2016 at 3:20 PM, Shenoy, Gourav Ganesh < > goshenoy@indiana.edu> wrote: > > Hi Mangirish, > > > > I have set up a Mesos-Marathon cluster for you on Jetstream. I will share > with you with the cluster details in a separate email. Kindly note that > there are 3 masters & 2 slaves in this cluster. > > > > I am also working on automating this process for Jetstream (similar to > Shameera=E2=80=99s ansible script for EC2) and when that is ready, we can= create > clusters or add/remove slave machines from the cluster. > > > > Thanks and Regards, > > Gourav Shenoy > > > > *From: *Mangirish Wagle > *Reply-To: *"dev@airavata.apache.org" > *Date: *Wednesday, September 21, 2016 at 2:36 PM > *To: *"dev@airavata.apache.org" > *Subject: *Running MPI jobs on Mesos based clusters > > > > Hello All, > > > > I would like to post for everybody's awareness about the study that I am > undertaking this fall, i.e. to evaluate various different frameworks that > would facilitate MPI jobs on Mesos based clusters for Apache Airavata. > > > > Some of the options that I am looking at are:- > > 1. MPI support framework bundled with Mesos > 2. Apache Aurora > 3. Marathon > 4. Chronos > > Some of the evaluation criteria that I am planning to base my > investigation are:- > > - Ease of setup > - Documentation > - Reliability features like HA > - Scaling and Fault recovery > - Performance > - Community Support > > Gourav and Shameera are working on ansible based automation to spin up a > mesos based cluster and I am planning to use it to setup a cluster for > experimentation. > > > > Any suggestions or information about prior work on this would be highly > appreciated. > > > > Thank you. > > > > Best Regards, > > Mangirish Wagle > > *Error! Filename not specified.* > > > > > > > > > --089e01228c12eef68b053ec1bf40 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
Hi Marlon,
Thanks for confirming and sh= aring the legal link.

-Mangirish

On Thu, Oct 13, 2016 at 12:13 PM, Pie= rce, Marlon <marpierc@iu.edu> wrote:

BSD is ok: https://www.apach= e.org/legal/resolved.

=C2=A0

From: Mangirish Wagle <vaglomangirish@gmail.com>
Reply-To: &quo= t;dev@airavata= .apache.org" <dev@airavata.apache.org>
Date: Thursday, Octob= er 13, 2016 at 12:03 PM
To: "dev@airavata.apache.org" <dev@airavata.apache.o= rg>
Subject: Re: Running MPI jobs on Mesos based clusters<= u>

=C2=A0

Hello Devs,=

I needed so= me advice on the license of the MPI libraries. The MPICH library that I hav= e been trying claims to have a "BSD Like" license (http://= git.mpich.org/mpich.git/blob/HEAD:/COPYRIGHT).

I am aware that O= penMPI which uses BSD license is currently used in our application. I had c= hosen to start investigating MPICH because it claims to be a highly portabl= e and high quality implementation of latest MPI standard, suitable to cloud= based clusters.

If anyone could please advise on the acceptance of the M= PICH libraries MSD Like license for ASF, that would help.

=

Thank you.

Best Regards,

Mangirish Wagle

=C2=A0

On Thu, Oct 6, 2016 at 1:48 AM, Mangirish Wa= gle <vaglo= mangirish@gmail.com> wrote:

Hello Devs,

= =C2=A0

The network issue= mentioned above now stands resolved. The problem was with the iptables had= some conflicting rules which blocked the traffic. It was resolved by simpl= e iptables flush.

=C2=A0

Here is the test MPI pr= ogram running on multiple machines:-

=C2=A0

[cen= tos@mesos-slave-1 ~]$ mpiexec -f machinefile -n 2 ./mpitest

Hello world!=C2=A0 I am process number:= 0 on host mesos-slave-1

Hello world!=C2=A0 I am process number: 1 on host mesos-slave-2<= /u>

=C2=A0<= /p>

The next step is to try invoking this = through framework like Marathon. However, the job submission still does not= run through Marathon. It seems to gets stuck in the 'waiting' stat= e forever (For example http://149.165.170.245:8080/ui/#/apps/%2Fma= w-try). Further, I notice that Marathon is listed under 'inactive f= rameworks' in mesos dashboard (http://149.165.171.33:5050/#/frameworks<= /a>).

<= /u>=C2=A0

Hello Devs,

=C2=A0

I am currently running a sample MPI C program using '= mpiexec' provided by MPICH. I followed their in= stallation guide=C2=A0to install the libraries on the master and slave = nodes of the mesos cluster.

=C2=A0

The approach = that I am trying out here is that I am equipping the underlying nodes with = MPI handling tools and then use the Mesos framework like Marathon/ Aurora t= o submit jobs to run MPI programs by invoking these tools.

=C2=A0

You can potentially run an MPI program using mpiexec in t= he following manner:-

=C2=A0

#=C2=A0mpiexec -= f machinefile -n 2 ./mpitest

  • machinefile -> File which contains an= inventory of machines to run the program on and number of processes on eac= h machine.
  • mpitest -> M= PI program compiled in C using mpicc compiler. The program returns the proc= ess number and he hostname of the machine running the process.
  • -n option indicates number of processe= s that it needs to spawn

Example of machinefile contents:-

=C2=A0

= # Entri= es in the format <hostname/IP>:<number of processes><= /u>

mesos-slave-1:1=

mesos-slave-2:1<= /p>

=C2=A0

The reason for = choosing slaves is that Mesos runs the jobs on slaves, managed by 'agen= ts' pertaining to the slaves.

=C2=A0

= Output of the program with '-n 1'= :-

=C2=A0<= u>

# mpiexec -f machinefile -n 1 ./mpitest

<= /div>

Hello = world!=C2=A0 I am process number: 0 on host mesos-slave-1

=C2=A0

But when I try for '-n 2', I am hitting t= he following error:-

=C2=A0

# mp= iexec -f machinefile -n 2 ./mpitest

[proxy:0:1@mesos-sla= ve-2] HYDU_sock_connect (/home/centos/mpich-3.2/src/pm/hydra/utils/soc= k/sock.c:172): unable to connect from "mesos-slave-2" to &qu= ot;mesos-slave-1" (No route to host)

[proxy:0:1@mes= os-slave-2] main (/home/centos/mpich-3.2/src/pm/hydra/pm/pmiserv/pmip.= c:189): unable to connect to server mesos-slave-1 at port 44788= (check for firewalls!)

=C2=A0

It seems to not allow the program execution due to network traffic being = blocked. I checked security groups in scigap openstack for mesos-slave-1, m= esos-slave-2 nodes and it is set to 'wideopen' policy. Furthermore,= I tried adding explicit rules to the policies to allow all TCP and UDP (Cu= rrently I am not sure what protocol is used underneath), even then it conti= nues throwing this error.

=C2=A0

Any clues, sugg= estions, comments about the error or approach as a whole would be helpful.<= u>

=C2=A0

=

Thanks and Regards,

Mangirish Wagle

<= p class=3D"MsoNormal">=C2=A0

Error! Filename not specified.

=C2=A0

On Tue, Sep 27, 2016 at 11:23 AM, Mangi= rish Wagle <vaglomangirish@gmail.com> wrote:

Hello Devs,

=C2=A0

Thanks Gourav and S= hameera for all the work w.r.t. setting up the Mesos-Marathon cluster on Je= tstream.

=C2=A0

I am currently evaluating MPICH = (http://= www.mpich.org/about/overview/) to be used for launching MPI jobs o= n top of mesos. MPICH version 1.2 supports Mesos based MPI scheduling. I ha= ve been also trying to submit jobs to the cluster through Marathon. However= , in either cases I am currently facing issues which I am working to get re= solved.

=C2=A0=

I am compiling my notes into the = following google doc. You may please review and let me know your comments, = suggestions.

=C2= =A0

=C2=A0

Thanks and Regards,

Mangirish Wagle



Error! Filename not specified.<= /u>

=C2=A0

=

On Wed, Sep 21, 2016 at = 3:20 PM, Shenoy, Gourav Ganesh <goshenoy@indiana.edu> wrote:

=

Hi Mangirish,

=C2=A0<= /u>

I have set up a Mesos-Marathon cluster for you on Jetstream.= I will share with you with the cluster details in a separate email. Kindly= note that there are 3 masters & 2 slaves in this cluster.

=C2=A0

I am also working on automati= ng this process for Jetstream (similar to Shameera=E2=80=99s ansible script= for EC2) and when that is ready, we can create clusters or add/remove slav= e machines from the cluster.

=C2=A0<= u>

Thanks and Regards,

Gourav Shenoy

=C2=A0

=C2=A0

Hello All,

=C2=A0

I would like to post for everybody's awareness about the st= udy that I am undertaking this fall, i.e. to evaluate various different fra= meworks that would facilitate MPI jobs on Mesos based clusters for Apache A= iravata.

=C2=A0

Some of the options that I am lo= oking at are:-

  1. MPI support framework bundled with Mesos
  2. Apache Aurora
  3. Marathon
  4. Chronos

Some of the evaluation cri= teria that I am planning to base my investigation are:-

  • Ease of setup<= /u>
  • Documentation
  • Reliability features like HA
  • Scaling and Fault recovery
  • Performance
  • Community Suppor= t

Gourav and Shame= era are working on ansible based automation to spin up a mesos based cluste= r and I am planning to use it to setup a cluster for experimentation.

=C2=A0

Any suggestions or information about prior wor= k on this would be highly appreciated.

=C2=A0

Th= ank you.

=C2=A0

Best Regards,

<= /div>

Mangirish Wagle

Error! Filename not specified.

=C2=A0

=C2=A0

=C2=A0

=C2=A0


--089e01228c12eef68b053ec1bf40--