Return-Path: X-Original-To: apmail-crunch-dev-archive@www.apache.org Delivered-To: apmail-crunch-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 0ACAD1038C for ; Thu, 5 Dec 2013 21:16:37 +0000 (UTC) Received: (qmail 55538 invoked by uid 500); 5 Dec 2013 21:16:37 -0000 Delivered-To: apmail-crunch-dev-archive@crunch.apache.org Received: (qmail 55511 invoked by uid 500); 5 Dec 2013 21:16:36 -0000 Mailing-List: contact dev-help@crunch.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@crunch.apache.org Delivered-To: mailing list dev@crunch.apache.org Received: (qmail 55503 invoked by uid 500); 5 Dec 2013 21:16:36 -0000 Delivered-To: apmail-incubator-crunch-dev@incubator.apache.org Received: (qmail 55499 invoked by uid 99); 5 Dec 2013 21:16:36 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 05 Dec 2013 21:16:36 +0000 Date: Thu, 5 Dec 2013 21:16:36 +0000 (UTC) From: "Josh Wills (JIRA)" To: crunch-dev@incubator.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Updated] (CRUNCH-296) Support new distributed execution engines (e.g., Spark) MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/CRUNCH-296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Wills updated CRUNCH-296: ------------------------------ Attachment: CRUNCH-296d.patch Adds support for requireSortedKeys() and the trick for applying combine functions in Spark that we discussed on the list. (Thanks to [~gabriel.reid] for that.) > Support new distributed execution engines (e.g., Spark) > ------------------------------------------------------- > > Key: CRUNCH-296 > URL: https://issues.apache.org/jira/browse/CRUNCH-296 > Project: Crunch > Issue Type: Improvement > Components: Core > Reporter: Josh Wills > Assignee: Josh Wills > Attachments: CRUNCH-296.patch, CRUNCH-296b.patch, CRUNCH-296c.patch, CRUNCH-296d.patch > > > I've been working on this off-and-on for awhile, but it's currently in a state where I feel like it's worth sharing: I came up with an implementation of the Crunch APIs that runs on top of Apache Spark instead of MapReduce. > My goal for this is pretty simple; I want to be able to change any instances of "new MRPipeline(...)" to "new SparkPipeline(...)", not change anything else at all, and have my pipelines run on Spark instead of as a series of MR jobs. Turns out that we can pretty much do exactly that. Not everything works yet, but lots of things do-- joins and cogroups work, the PageRank and TfIdf integration tests work. Some things that do not work that I'm aware of: in-memory joins and some of the more complex file output handling rules, but I believe that these things are fixable. Some thing that might work or might not: HBase inputs and outputs on top of Spark. > This is just an idea I had, and I would understand if other people don't want to work on this or don't think it's the right direction for the project. My minimal request would be to include the refactoring of the core APIs necessary to support plugging in new execution frameworks so I can keep working on this stuff. -- This message was sent by Atlassian JIRA (v6.1#6144)