airavata-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Shenoy, Gourav Ganesh" <goshe...@indiana.edu>
Subject Re: Spinup Mesos-Marathon Cluster for Hybrid Scheduling
Date Sat, 24 Sep 2016 17:01:13 GMT
Hi dev,

I have implemented Ansbile script for provisioning OpenStack instances, which can serve as
nodes for running Mesos/Marathon on top of it. This is in addition to Shameera’s script
which provisions EC2 instances & deploy’s a Mesos-Marathon cluster on the nodes.

I have created a pull-request, but these changes have been added to my earlier pull-request
for CloudBridge EC2 provisioning (cloud module), since it wasn’t merged. @Shameera, kindly
review the PR and provide feedback/comments if any changes are needed.

Thanks and Regards,
Gourav Shenoy

From: Shameera Rathnayaka <shameerainfo@gmail.com>
Reply-To: "dev@airavata.apache.org" <dev@airavata.apache.org>
Date: Friday, September 23, 2016 at 10:03 PM
To: "dev@airavata.apache.org" <dev@airavata.apache.org>
Subject: Re: Spinup Mesos-Marathon Cluster for Hybrid Scheduling

Hi Devs,

As there were no any objections, I imported my local mesos ansible script from https://github.com/shamrath/mesos-deployment
 which already under Apache 2.0 License, to Apache Airavata preserving all history.

Thanks,
Shameera.

On Wed, Sep 21, 2016 at 7:13 PM Shameera Rathnayaka <shameerainfo@gmail.com<mailto:shameerainfo@gmail.com>>
wrote:
Hi Gourav,

This is known issue, I have already mentioned above workaround in the project README file,
see below


1.    set valid aws credentials in roles/ec2/vars/aws-credential.yml if it doesn't work add
following to ec2 task in roles/ec2/tasks/main.yml

aws_access_key: <your_valid_access_key>

aws_secret_key: <your_valid_secret_key?

Regards,
Shameera.


On Wed, Sep 21, 2016 at 6:26 PM Shenoy, Gourav Ganesh <goshenoy@indiana.edu<mailto:goshenoy@indiana.edu>>
wrote:
Hi dev,

I just hit another problem with the ansible script for mesos-deployment. This issue is related
to creating instances in ec2 using the ansible playbook. The fix is mentioned later below.

In particular, when you run the command (which would spin up 4 machines in EC2):
ansible-playbook -i hosts site.yml -t "ec2"

you might see the below authentication error:

TASK [ec2 : create a aws instace/s] ********************************************
failed: [localhost] (item=gs-mesos-master-1) => {"failed": true, "item": "gs-mesos-master-1",
"msg": "No handler was ready to authenticate. 1 handlers were checked. ['HmacAuthV4Handler']
Check your credentials"}
failed: [localhost] (item=gs-mesos-master-2) => {"failed": true, "item": "gs-mesos-master-2",
"msg": "No handler was ready to authenticate. 1 handlers were checked. ['HmacAuthV4Handler']
Check your credentials"}
failed: [localhost] (item=gs-mesos-master-3) => {"failed": true, "item": "gs-mesos-master-3",
"msg": "No handler was ready to authenticate. 1 handlers were checked. ['HmacAuthV4Handler']
Check your credentials"}
failed: [localhost] (item=gs-mesos-slave-1) => {"failed": true, "item": "gs-mesos-slave-1",
"msg": "No handler was ready to authenticate. 1 handlers were checked. ['HmacAuthV4Handler']
Check your credentials"}

This is because the ansible playbook is not able to authenticate the user, even if you have
updated the “roles/ec2/vars/aws-credential.yml” file with your AWS access & secret
keys.

I was able to resolve this issue by adding the following (highlighted in yellow) to “roles/ec2/tasks/main.yml”
file – which runs the task of creating the EC2 instances.

- name: create a aws instace/s
  ec2:
    aws_access_key: "{{aws_access_key}}"
    aws_secret_key: "{{aws_secret_key}}"
    key_name: "{{ key_name }}"
    region: us-east-1

Basically, this ansible task had no way of knowing the user credentials when it tried to create
the instance(s), hence the error. Hope this helps!

@Shameera,
Is this a valid fix? If yes, could you update the ansible script? Thanks in advance.

Thanks and Regards,
Gourav Shenoy

From: Suresh Marru <smarru@apache.org<mailto:smarru@apache.org>>
Reply-To: "dev@airavata.apache.org<mailto:dev@airavata.apache.org>" <dev@airavata.apache.org<mailto:dev@airavata.apache.org>>
Date: Friday, September 16, 2016 at 11:02 PM

To: Airavata Dev <dev@airavata.apache.org<mailto:dev@airavata.apache.org>>
Subject: Re: Spinup Mesos-Marathon Cluster for Hybrid Scheduling

Hi Gourav,

Thank you for this excellent communication. Hope others will follow suite on such mailing
lists updates. When you post such nontrivial diagnosis to the mailing lists, others having
trouble will be able to search on this thread and follow these to fix.

Hoping to see lot more dev list threads similar to this one.

Suresh

On Sep 16, 2016, at 10:16 PM, Shenoy, Gourav Ganesh <goshenoy@indiana.edu<mailto:goshenoy@indiana.edu>>
wrote:

Hi dev,

I finally managed to get the mesos-marathon cluster up & running using the Ansible script.
There were couple of issues because of which things were failing. I have listed the problems
faced during installation & the solutions that fixed things for me.

1.  Marathon was not getting installed – This is because Marathon just released a new build
(version: 1.3.0-1.0.506.el7) 2 days back and apparently the RPM for this version is corrupt,
and hence a plain “yum install marathon” fails. To get around, I listed all versions of
marathon present in the repository.
# yum --showduplicates list marathon | expand
marathon.x86_64                 1.1.3-1.0.503.el7                    mesosphere
marathon.x86_64                 1.3.0-1.0.506.el7                    mesosphere

The next latest version was 1.1.3-1.0.503.el7 which seemed stable, and hence I updated the
ansible task to use this version for marathon.

In “roles/mesos-master/tasks/main.yml” I updated the following:
- name: install mesos and marathon
  yum:
    name: "{{ item }}"
  with_items:
    - mesos
    - marathon-1.1.3-1.0.503.el7

The mesos-marathon cluster installation worked perfectly fine after this change.

2.       Even after this, the command “mesos-resolve `cat /etc/mesos/zk`” was failing
with the error Failed to obtain the IP address for 'ip-172-30-1-197'; the DNS service may
not be able to resolve it: Name or service not known

Apparently it couldn’t resolve the hostname for the local master machine. I resolved this
issue by adding a host entry in each master node.
Eg: On master node which threw above error, I added the host entry (/etc/hosts):
172.30.1.197       ip-172-30-1-197

After this I was able to get the master-ip and visit the mesos dashboard (master-ip:5050)

3.       I noticed that although I was able to view the mesos dashboard, I couldn’t access
the marathon dashboard. The connection to <master-ip>:8080 was getting refused. I then
restarted the marathon service on the master node – sudo service marathon restart. After
this I was able to access the marathon dashboard as well.

Hope this helps!

Thanks and Regards,
Gourav Shenoy

From: "Shenoy, Gourav Ganesh" <goshenoy@indiana.edu<mailto:goshenoy@indiana.edu>>
Reply-To: "dev@airavata.apache.org<mailto:dev@airavata.apache.org>" <dev@airavata.apache.org<mailto:dev@airavata.apache.org>>
Date: Friday, September 16, 2016 at 3:52 PM
To: "dev@airavata.apache.org<mailto:dev@airavata.apache.org>" <dev@airavata.apache.org<mailto:dev@airavata.apache.org>>
Subject: Re: Spinup Mesos-Marathon Cluster for Hybrid Scheduling

Hi Shameera,

As discussed, after commenting out the “marathon” section the ansible playbooks execute
without errors. But when I try to get the master-ip using “mesos-resolve”, I get an error:

I SSH’ed into one of the master machine and tried to check the status of the mesos-master
service, seems like the service is in failed state. See the trace below:

[centos@ip-172-30-1-39 ~]$ sudo service mesos-master status
Redirecting to /bin/systemctl status  mesos-master.service
● mesos-master.service - Mesos Master
   Loaded: loaded (/usr/lib/systemd/system/mesos-master.service; enabled; vendor preset: disabled)
   Active: activating (auto-restart) (Result: exit-code) since Fri 2016-09-16 19:46:37 UTC;
18s ago
  Process: 12608 ExecStart=/usr/bin/mesos-init-wrapper master (code=exited, status=1/FAILURE)
Main PID: 12608 (code=exited, status=1/FAILURE)

Sep 16 19:46:37 ip-172-30-1-39 systemd[1]: Unit mesos-master.service entered failed state.
Sep 16 19:46:37 ip-172-30-1-39 systemd[1]: mesos-master.service failed.

Hope this helps debugging the problem.

Thanks and Regards,
Gourav Shenoy

From: Suresh Marru <smarru@apache.org<mailto:smarru@apache.org>>
Reply-To: "dev@airavata.apache.org<mailto:dev@airavata.apache.org>" <dev@airavata.apache.org<mailto:dev@airavata.apache.org>>
Date: Friday, September 16, 2016 at 9:30 AM
To: Airavata Dev <dev@airavata.apache.org<mailto:dev@airavata.apache.org>>
Subject: Re: Spinup Mesos-Marathon Cluster for Hybrid Scheduling

Hi Shameera,

All of these are great directions for Airavata, thank you for pushing the Ansible and Mesos
deployments on the clouds. I think it will be better if we get your scripts into Airavata
repo and all of us collectively work on it. Looks like atleast Pankaj and Gourav will also
be able to contribution in addition to you.

Suresh

On Sep 15, 2016, at 8:59 PM, Shenoy, Gourav Ganesh <goshenoy@indiana.edu<mailto:goshenoy@indiana.edu>>
wrote:

Sure, thanks Shameera. I will try this.

Thanks and Regards,
Gourav Shenoy

From: Shameera Rathnayaka <shameerainfo@gmail.com<mailto:shameerainfo@gmail.com>>
Reply-To: "dev@airavata.apache.org<mailto:dev@airavata.apache.org>" <dev@airavata.apache.org<mailto:dev@airavata.apache.org>>
Date: Thursday, September 15, 2016 at 8:55 PM
To: "dev@airavata.apache.org<mailto:dev@airavata.apache.org>" <dev@airavata.apache.org<mailto:dev@airavata.apache.org>>
Subject: Re: Spinup Mesos-Marathon Cluster for Hybrid Scheduling

Interesting, I am also getting the same issue. The same script worked perfectly yesterday.
I doubt some issue with marathon rpm. By removing marathon installation Mesos get installed
without any issue.

to remove marathon installation do following to /roles/mesos-master/tasks/main.yml file.

1. comment marathon in "install mesos and marathon" task
2. comment the last task which start marathon

Meanwhile, i will try to find exact reason.

~ Shameera.

On Thu, Sep 15, 2016 at 8:32 PM Shenoy, Gourav Ganesh <goshenoy@indiana.edu<mailto:goshenoy@indiana.edu>>
wrote:
Hi Shameera,

I am using the same image which you used (centos_ami_7_2: ami-6d1c2007).

Thanks and Regards,
Gourav Shenoy

From: Shameera Rathnayaka <shameerainfo@gmail.com<mailto:shameerainfo@gmail.com>>
Reply-To: "dev@airavata.apache.org<mailto:dev@airavata.apache.org>" <dev@airavata.apache.org<mailto:dev@airavata.apache.org>>
Date: Thursday, September 15, 2016 at 8:26 PM
To: "dev@airavata.apache.org<mailto:dev@airavata.apache.org>" <dev@airavata.apache.org<mailto:dev@airavata.apache.org>>
Subject: Re: Spinup Mesos-Marathon Cluster for Hybrid Scheduling

Hi Gourav,

According to the error, something have happened while unpacking marathon bundle, see:  Installing
: marathon-1.3.0-1.0.506.el7.x86_64                            1/1 \nerror: unpacking of archive
failed on file /usr/bin/marathon;57daffff: cpio: read\n  Verifying  : marathon-1.3.0-1.0.506.el7.x86_64
                           1/1 \n\nFailed:\n  marathon.x86_64 0:1.3.0-1.0.506.el7

What OS image and version you used to create instances? I tested with centos 7.2 and it works
fine.

~ Shameera.


On Thu, Sep 15, 2016 at 8:14 PM Shenoy, Gourav Ganesh <goshenoy@indiana.edu<mailto:goshenoy@indiana.edu>>
wrote:
Hi Shameera,

I am trying to build a mesos cluster on EC2 using your playbooks. But I am facing some issues.
Please find the details below:

Details:
-          I created 4 instances on EC2 (us-east-1 region) using the cloud-provisioning module
(CloudBridge python). Out of the 4, 3 were meant to be mesos masters & 1 slave.
Note: The instance inbound & outbount traffic is wideopen.
-          I skipped step-1 & step-2 in your README, since I manually provisioned the
instances. Next, I updated “hosts” file with public IPs for all 4 instances. And also
updated the “roles/zookeeper/vars/main.yml” file with the private IPs of 3 master instances.
-          I executed the “ansible-playbook -i hosts site.yml -t "mesos-master"” command,
and I get the following error:

TASK [mesos-master : install firewalld] ****************************************
ok: [52.91.152.1]
ok: [52.87.235.79]
ok: [54.167.94.186]

TASK [mesos-master : start firewalld] ******************************************
ok: [52.91.152.1]
ok: [52.87.235.79]
ok: [54.167.94.186]

TASK [mesos-master : open ports] ***********************************************
ok: [52.91.152.1] => (item=5050/tcp)
ok: [52.87.235.79] => (item=5050/tcp)
ok: [54.167.94.186] => (item=5050/tcp)
ok: [52.87.235.79] => (item=8080/tcp)
ok: [54.167.94.186] => (item=8080/tcp)
ok: [52.91.152.1] => (item=8080/tcp)

TASK [mesos-master : install utility - TODO delete this] ***********************
ok: [52.91.152.1] => (item=[u'vim'])
ok: [52.87.235.79] => (item=[u'vim'])
ok: [54.167.94.186] => (item=[u'vim'])

TASK [mesos-master : add mesosphere rpm] ***************************************
ok: [52.91.152.1]
ok: [52.87.235.79]
ok: [54.167.94.186]

TASK [mesos-master : install mesos and marathon] *******************************
failed: [52.91.152.1] (item=[u'mesos', u'marathon']) => {"changed": true, "failed": true,
"item": ["mesos", "marathon"], "msg": "Error unpacking rpm package marathon-1.3.0-1.0.506.el7.x86_64\n",
"rc": 1, "results": ["All packages providing mesos are up to date", "Loaded plugins: fastestmirror\nLoading
mirror speeds from cached hostfile\n * base: mirrors.tripadvisor.com<http://mirrors.tripadvisor.com/>\n
* extras: centos.hostingxtreme.com<http://centos.hostingxtreme.com/>\n * updates: mirrors.greenmountainaccess.net<http://mirrors.greenmountainaccess.net/>\nResolving
Dependencies\n--> Running transaction check\n---> Package marathon.x86_64 0:1.3.0-1.0.506.el7
will be installed\n--> Finished Dependency Resolution\n\nDependencies Resolved\n\n================================================================================\n
Package         Arch          Version                  Repository         Size\n================================================================================\nInstalling:\n
marathon        x86_64        1.3.0-1.0.506.el7        mesosphere         17 M\n\nTransaction
Summary\n================================================================================\nInstall
 1 Package\n\nTotal download size: 17 M\nInstalled size: 89 M\nDownloading packages:\nRunning
transaction check\nRunning transaction test\nTransaction test succeeded\nRunning transaction\n
 Installing : marathon-1.3.0-1.0.506.el7.x86_64                            1/1 \nerror: unpacking
of archive failed on file /usr/bin/marathon;57daffff: cpio: read\n  Verifying  : marathon-1.3.0-1.0.506.el7.x86_64
                           1/1 \n\nFailed:\n  marathon.x86_64 0:1.3.0-1.0.506.el7        
                                  \n\nComplete!\n"]}
failed: [52.87.235.79] (item=[u'mesos', u'marathon']) => {"changed": true, "failed": true,
"item": ["mesos", "marathon"], "msg": "Error unpacking rpm package marathon-1.3.0-1.0.506.el7.x86_64\n",
"rc": 1, "results": ["All packages providing mesos are up to date", "Loaded plugins: fastestmirror\nLoading
mirror speeds from cached hostfile\n * base: mirrors.tripadvisor.com<http://mirrors.tripadvisor.com/>\n
* extras: mirrors.evowise.com<http://mirrors.evowise.com/>\n * updates: mirrors.greenmountainaccess.net<http://mirrors.greenmountainaccess.net/>\nResolving
Dependencies\n--> Running transaction check\n---> Package marathon.x86_64 0:1.3.0-1.0.506.el7
will be installed\n--> Finished Dependency Resolution\n\nDependencies Resolved\n\n================================================================================\n
Package         Arch          Version                  Repository         Size\n================================================================================\nInstalling:\n
marathon        x86_64        1.3.0-1.0.506.el7        mesosphere         17 M\n\nTransaction
Summary\n================================================================================\nInstall
 1 Package\n\nTotal download size: 17 M\nInstalled size: 89 M\nDownloading packages:\nRunning
transaction check\nRunning transaction test\nTransaction test succeeded\nRunning transaction\n
 Installing : marathon-1.3.0-1.0.506.el7.x86_64                            1/1 \nerror: unpacking
of archive failed on file /usr/bin/marathon;57daffff: cpio: read\n  Verifying  : marathon-1.3.0-1.0.506.el7.x86_64
                           1/1 \n\nFailed:\n  marathon.x86_64 0:1.3.0-1.0.506.el7        
                                  \n\nComplete!\n"]}
failed: [54.167.94.186] (item=[u'mesos', u'marathon']) => {"changed": true, "failed": true,
"item": ["mesos", "marathon"], "msg": "Error unpacking rpm package marathon-1.3.0-1.0.506.el7.x86_64\n",
"rc": 1, "results": ["All packages providing mesos are up to date", "Loaded plugins: fastestmirror\nLoading
mirror speeds from cached hostfile\n * base: mirrors.tripadvisor.com<http://mirrors.tripadvisor.com/>\n
* extras: mirrors.evowise.com<http://mirrors.evowise.com/>\n * updates: mirrors.greenmountainaccess.net<http://mirrors.greenmountainaccess.net/>\nResolving
Dependencies\n--> Running transaction check\n---> Package marathon.x86_64 0:1.3.0-1.0.506.el7
will be installed\n--> Finished Dependency Resolution\n\nDependencies Resolved\n\n================================================================================\n
Package         Arch          Version                  Repository         Size\n================================================================================\nInstalling:\n
marathon        x86_64        1.3.0-1.0.506.el7        mesosphere         17 M\n\nTransaction
Summary\n================================================================================\nInstall
 1 Package\n\nTotal download size: 17 M\nInstalled size: 89 M\nDownloading packages:\nRunning
transaction check\nRunning transaction test\nTransaction test succeeded\nRunning transaction\n
 Installing : marathon-1.3.0-1.0.506.el7.x86_64                            1/1 \nerror: unpacking
of archive failed on file /usr/bin/marathon;57daffff: cpio: read\n  Verifying  : marathon-1.3.0-1.0.506.el7.x86_64
                           1/1 \n\nFailed:\n  marathon.x86_64 0:1.3.0-1.0.506.el7        
                                  \n\nComplete!\n"]}

NO MORE HOSTS LEFT *************************************************************

RUNNING HANDLER [zookeeper : restart zookeeper] ********************************
[WARNING]: Could not create retry file 'site.retry'.         [Errno 2] No such file or directory:
''


PLAY RECAP *********************************************************************
52.87.235.79               : ok=17   changed=2    unreachable=0    failed=1
52.91.152.1                : ok=17   changed=2    unreachable=0    failed=1
54.167.94.186              : ok=17   changed=2    unreachable=0    failed=1
localhost                  : ok=1    changed=0    unreachable=0    failed=0

Is there some step that I am missing? It looks like the instances are not able to communicate
because of the firewall? This is just a wild guess. Any help here is appreciated.

Thanks and Regards,
Gourav Shenoy

From: Shameera Rathnayaka <shameerainfo@gmail.com<mailto:shameerainfo@gmail.com>>
Reply-To: "dev@airavata.apache.org<mailto:dev@airavata.apache.org>" <dev@airavata.apache.org<mailto:dev@airavata.apache.org>>
Date: Monday, September 12, 2016 at 11:19 AM
To: dev <dev@airavata.apache.org<mailto:dev@airavata.apache.org>>
Subject: Spinup Mesos-Marathon Cluster for Hybrid Scheduling

Hi Dev,

The effort of getting use Cloud infrastructure to run MPI and BigData jobs using Airavata,
we use Apache Mesos as  resource allocation framework to manage different type of clusters
(i.e HPC node cluster to run MPI jobs and spark, Hadoop big data clusters to run bigdata applications).
I came up with Ansible script to spin up Mesos cluster on the target set of nodes. You can
find the script herehttps://github.com/shamrath/mesos-deployment I am thinking of  move this
code to Airavata if all agreed. I would happy to answer any question related to this.

Thanks,
Shameera.
--
Shameera Rathnayaka
--
Shameera Rathnayaka
--
Shameera Rathnayaka


--
Shameera Rathnayaka
--
Shameera Rathnayaka
Mime
View raw message