spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Florian Verhein (JIRA)" <>
Subject [jira] [Commented] (SPARK-5552) Automated data science AMI creation and data science cluster deployment on EC2
Date Wed, 04 Feb 2015 00:54:34 GMT


Florian Verhein commented on SPARK-5552:

Thanks [~sowen]. 

So it wouldn't fit in the spark repo itself (the only change there would be to add an option
in to use an alternate spark-ec2 repo/branch). It would naturally live in spark-ec2,
as it  involves changes to spark-ec2 for both use cases
- Image creation is based on the work soon to be added to spark-ec2 for this:
- Cluster deployment+configuration is done using the spark-ec2 scripts themselves (but with
many modifications/fixes).

Since there is a dependency between the image and the configuration ( and
scripts, it's not possible to solve this with just an AMI.

The extra components (actually, just vowpal wabbit and more python libraries - the rest already
exists in spark-ec2 AMI) are just added to the image for data science convenience.

> Automated data science AMI creation and data science cluster deployment on EC2
> ------------------------------------------------------------------------------
>                 Key: SPARK-5552
>                 URL:
>             Project: Spark
>          Issue Type: New Feature
>          Components: EC2
>            Reporter: Florian Verhein
> Issue created RE: (please
read for background)
> Goal:
> Extend spark-ec2 scripts to create an automated data science cluster deployment on EC2,
suitable for almost(?)-production use.
> Use cases: 
> - A user can build their own custom data science AMIs from a CentOS minimal image by
calling a packer configuration (good defaults should be provided, some options for flexibility)
> - A user can then easily deploy a new (correctly configured) cluster using these AMIs,
and do so as quickly as possible.
> Components/modules: Spark + tachyon + hdfs (on instance storage) + python + R + vowpal
wabbit + any rpms + ... + ganglia
> Focus is on reliability (rather than e.g. supporting many versions / dev testing) and
speed of deployment.
> Use hadoop 2 so option to lift into yarn later.
> My current solution is here:
It includes other fixes/improvements as needed to get it working.
> Now that it seems to work (but has deviated a lot more from the existing code base than
I was expecting), I'm wondering what to do with it...
> Keen to hear ideas if anyone is interested. 

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message