crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Gabriel Reid (JIRA)" <>
Subject [jira] [Commented] (CRUNCH-296) Support new distributed execution engines (e.g., Spark)
Date Tue, 10 Dec 2013 14:01:09 GMT


Gabriel Reid commented on CRUNCH-296:

Looks good to me in general, although I'm running into one small issue -- it's not compiling
for me directly under maven (jdk 1.7.0_40 on Mac OS X), although it is fine compiling within

The issue is line 42 in Replacing that line with 
    return (CombineFn) fn;
resolves the issue for me, although looking at that change that's required to get it to compile
and the generics info that's being thrown away, I wonder if there is something else to worry
about there.

At first I wasn't so sure about the new cache() methods on PCollection, but thinking about
it more I think it's actually even a more logical naming for the way that materialize() is
currently used to cache results of a computation on disk, so I'm all for it.

> Support new distributed execution engines (e.g., Spark)
> -------------------------------------------------------
>                 Key: CRUNCH-296
>                 URL:
>             Project: Crunch
>          Issue Type: Improvement
>          Components: Core
>            Reporter: Josh Wills
>            Assignee: Josh Wills
>         Attachments: CRUNCH-296.patch, CRUNCH-296b.patch, CRUNCH-296c.patch, CRUNCH-296d.patch,
> I've been working on this off-and-on for awhile, but it's currently in a state where
I feel like it's worth sharing: I came up with an implementation of the Crunch APIs that runs
on top of Apache Spark instead of MapReduce.
> My goal for this is pretty simple; I want to be able to change any instances of "new
MRPipeline(...)" to "new SparkPipeline(...)", not change anything else at all, and have my
pipelines run on Spark instead of as a series of MR jobs. Turns out that we can pretty much
do exactly that. Not everything works yet, but lots of things do-- joins and cogroups work,
the PageRank and TfIdf integration tests work. Some things that do not work that I'm aware
of: in-memory joins and some of the more complex file output handling rules, but I believe
that these things are fixable. Some thing that might work or might not: HBase inputs and outputs
on top of Spark.
> This is just an idea I had, and I would understand if other people don't want to work
on this or don't think it's the right direction for the project. My minimal request would
be to include the refactoring of the core APIs necessary to support plugging in new execution
frameworks so I can keep working on this stuff.

This message was sent by Atlassian JIRA

View raw message