crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Josh Wills (JIRA)" <>
Subject [jira] [Created] (CRUNCH-296) Support new distributed execution engines (e.g., Spark)
Date Mon, 18 Nov 2013 07:15:21 GMT
Josh Wills created CRUNCH-296:

             Summary: Support new distributed execution engines (e.g., Spark)
                 Key: CRUNCH-296
             Project: Crunch
          Issue Type: Improvement
          Components: Core
            Reporter: Josh Wills
            Assignee: Josh Wills

I've been working on this off-and-on for awhile, but it's currently in a state where I feel
like it's worth sharing: I came up with an implementation of the Crunch APIs that runs on
top of Apache Spark instead of MapReduce.

My goal for this is pretty simple; I want to be able to change any instances of "new MRPipeline(...)"
to "new SparkPipeline(...)", not change anything else at all, and have my pipelines run on
Spark instead of as a series of MR jobs. Turns out that we can pretty much do exactly that.
Not everything works yet, but lots of things do-- joins and cogroups work, the PageRank and
TfIdf integration tests work. Some things that do not work that I'm aware of: in-memory joins
and some of the more complex file output handling rules, but I believe that these things are
fixable. Some thing that might work or might not: HBase inputs and outputs on top of Spark.

This is just an idea I had, and I would understand if other people don't want to work on this
or don't think it's the right direction for the project. My minimal request would be to include
the refactoring of the core APIs necessary to support plugging in new execution frameworks
so I can keep working on this stuff.

This message was sent by Atlassian JIRA

View raw message