Return-Path: X-Original-To: apmail-incubator-crunch-dev-archive@minotaur.apache.org Delivered-To: apmail-incubator-crunch-dev-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 71CFDD217 for ; Mon, 24 Sep 2012 20:34:51 +0000 (UTC) Received: (qmail 60112 invoked by uid 500); 24 Sep 2012 20:34:51 -0000 Delivered-To: apmail-incubator-crunch-dev-archive@incubator.apache.org Received: (qmail 60071 invoked by uid 500); 24 Sep 2012 20:34:51 -0000 Mailing-List: contact crunch-dev-help@incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: crunch-dev@incubator.apache.org Delivered-To: mailing list crunch-dev@incubator.apache.org Received: (qmail 60063 invoked by uid 99); 24 Sep 2012 20:34:51 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 24 Sep 2012 20:34:51 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of jwills@cloudera.com designates 209.85.223.175 as permitted sender) Received: from [209.85.223.175] (HELO mail-ie0-f175.google.com) (209.85.223.175) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 24 Sep 2012 20:34:43 +0000 Received: by iebc13 with SMTP id c13so5058754ieb.6 for ; Mon, 24 Sep 2012 13:34:22 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113; h=mime-version:from:date:message-id:subject:to:content-type :x-gm-message-state; bh=DforjjrRhagLKhof2A9X35E8QE9xV730mn/iRptwdpE=; b=OVxzzwkfy4XHO2mVbGdcP8s44x0jOGqw95kl/i1DEJAdc/rC9f1myQVN+XpSVthr6X uihZi5+lHCtn5XevsxZ3YyRAwv+F55aR00va4Tg0vqqh+ep96LG32jxwQVlBVYQq9hQQ m7Xx1Uz41IOSysriQM2Odv2UzHreB7LtyslR7NnoAamVOdiwZbhSNDMdzTmLXdlOSRhr TN+XUU133MXJzqew0Pr7JK96OkIfEJwwL4eUfn2V/ghe8YZff8IP+e4RYqZibmjl5uy+ QESwI6pyyQ0ToZSmiTakCpZt+97VCiRvAy23vjELayc77XeLH5SVpSPkB/ICaLd0R9sq j60g== Received: by 10.50.158.201 with SMTP id ww9mr6469839igb.22.1348518862781; Mon, 24 Sep 2012 13:34:22 -0700 (PDT) MIME-Version: 1.0 Received: by 10.50.74.196 with HTTP; Mon, 24 Sep 2012 13:34:02 -0700 (PDT) From: Josh Wills Date: Mon, 24 Sep 2012 13:34:02 -0700 Message-ID: Subject: Supporting legacy Mapper and Reducer classes in Crunch To: crunch-dev@incubator.apache.org Content-Type: text/plain; charset=ISO-8859-1 X-Gm-Message-State: ALoCoQnVA881HZAnBgUAeWtbNPM+vC0HQ+a0eivmMQLF694VEt0nxVt2/J5I4zeG3xtZqWVVdhbJ One of the ideas that Gabriel mentioned on our last epic architecture thread has stuck w/me, and that was adding support for using a pre-existing Mapper and Reducer class on the Crunch APIs, so that you could do something like: pipeline.read(From.tableSource(...)) .parallelDo(new SomeDoFn(), ...) .parallelDo(mapperFn(Mapper.class), ...) .groupByKey() .parallelDo(reducerFn(Reducer.class), ...) .parallelDo(new OtherDoFn(), ...) .write(To.tableTarget(...)); This turns out to be kind of tricky to do no matter how we approach the problem, because for this to work, we'll need to (at a minimum) subclass the Mapper.Context and Reducer.Context classes that are passed to the Mapper and Reducer instances, and they have different implementations (most importantly for our purposes, different constructors) under Hadoop 1 and 2. It feels to me that what I need to do is create a separate subproject that has to do some crazy stuff (e.g., use different source directories depending on the value of the crunch.platform variable) in order to be able to create the appropriate kind of subclass of Mapper.Context or Reducer.Context. But this sort of thing seems like such a bad idea that there must be some sort of less-bad option available to me, and I wanted to solicit input before I start tilting at this particular windmill. Thanks! Josh -- Director of Data Science Cloudera Twitter: @josh_wills