spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Florian Verhein (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SPARK-5552) Automated data science AMI creation and data science cluster deployment on EC2
Date Wed, 04 Feb 2015 00:54:34 GMT

    [ https://issues.apache.org/jira/browse/SPARK-5552?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14304412#comment-14304412
] 

Florian Verhein commented on SPARK-5552:
----------------------------------------

Thanks [~sowen]. 

So it wouldn't fit in the spark repo itself (the only change there would be to add an option
in spark_ec2.py to use an alternate spark-ec2 repo/branch). It would naturally live in spark-ec2,
as it  involves changes to spark-ec2 for both use cases
- Image creation is based on the work soon to be added to spark-ec2 for this: https://issues.apache.org/jira/browse/SPARK-3821
- Cluster deployment+configuration is done using the spark-ec2 scripts themselves (but with
many modifications/fixes).

Since there is a dependency between the image and the configuration (init.sh and setup.sh)
scripts, it's not possible to solve this with just an AMI.

The extra components (actually, just vowpal wabbit and more python libraries - the rest already
exists in spark-ec2 AMI) are just added to the image for data science convenience.


> Automated data science AMI creation and data science cluster deployment on EC2
> ------------------------------------------------------------------------------
>
>                 Key: SPARK-5552
>                 URL: https://issues.apache.org/jira/browse/SPARK-5552
>             Project: Spark
>          Issue Type: New Feature
>          Components: EC2
>            Reporter: Florian Verhein
>
> Issue created RE: https://github.com/mesos/spark-ec2/pull/90#issuecomment-72597154 (please
read for background)
> Goal:
> Extend spark-ec2 scripts to create an automated data science cluster deployment on EC2,
suitable for almost(?)-production use.
> Use cases: 
> - A user can build their own custom data science AMIs from a CentOS minimal image by
calling a packer configuration (good defaults should be provided, some options for flexibility)
> - A user can then easily deploy a new (correctly configured) cluster using these AMIs,
and do so as quickly as possible.
> Components/modules: Spark + tachyon + hdfs (on instance storage) + python + R + vowpal
wabbit + any rpms + ... + ganglia
> Focus is on reliability (rather than e.g. supporting many versions / dev testing) and
speed of deployment.
> Use hadoop 2 so option to lift into yarn later.
> My current solution is here: https://github.com/florianverhein/spark-ec2/tree/packer.
It includes other fixes/improvements as needed to get it working.
> Now that it seems to work (but has deviated a lot more from the existing code base than
I was expecting), I'm wondering what to do with it...
> Keen to hear ideas if anyone is interested. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message