spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Matt Goodman <>
Subject Re: Should spark-ec2 get its own repo?
Date Sat, 11 Jul 2015 18:07:12 GMT
I wanted to revive the conversation about the spark-ec2 tools, as it seems
to have been lost in the 1.4.1 release voting spree.

I think that splitting it into its own repository is a really good move,
and I would also be happy to help with this transition, as well as help
maintain the resulting repository.  Here is my justification for why we
ought to do this split.

User Facing:

   - The spark-ec2 launcher dosen't use anything in the parent spark
   - spark-ec2 version is disjoint from the parent repo.  I consider it
   confusing that the spark-ec2 script dosen't launch the version of spark it
   is checked-out with.
   - Someone interested in setting up spark-ec2 with anything but the
   default configuration will have to clone at least 2 repositories at
   present, and probably fork and push changes to 1.
   - spark-ec2 has mismatched dependencies wrt. to spark itself.  This
   includes a confusing shim in the spark-ec2 script to install boto, which
   frankly should just be a dependency of the script

Developer Facing:

   - Support across 2 repos will be worse than across 1.  Its unclear where
   to file issues/PRs, and requires extra communications for even fairly
   trivial stuff.
   - Spark-ec2 also depends on a number binary blobs being in the right
   place, currently the responsibility for these is decentralized, and likely
   prone to various flavors of dumb.
   - The current flow of booting a spark-ec2 cluster is _complicated_ I
   spent the better part of a couple days figuring out how to integrate our
   custom tools into this stack.  This is very hard to fix when commits/PR's
   need to span groups/repositories/buckets-o-binary, I am sure there are
   several other problems that are languishing under similar roadblocks
   - It makes testing possible.  The spark-ec2 script is a great case for
   CI given the number of permutations of launch criteria there are.  I
   suspect AWS would be happy to foot the bill on spark-ec2 testing (probably
   ~20 bucks a month based on some envelope sketches), as it is a piece of
   software that directly impacts other people giving them money.  I have some
   contacts there, and I am pretty sure this would be an easy conversation,
   particularly if the repo directly concerned with ec2.  Think also being
   able to assemble the binary blobs into s3 bucket dedicated to spark-ec2

Any other thoughts/voices appreciated here.  spark-ec2 is a super-power
tool and deserves a fair bit of attention!
--Matthew Goodman

Check Out My Website:
Find me on LinkedIn:

View raw message