Return-Path: X-Original-To: apmail-crunch-user-archive@www.apache.org Delivered-To: apmail-crunch-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 65BC410D3A for ; Thu, 15 Aug 2013 13:55:21 +0000 (UTC) Received: (qmail 88618 invoked by uid 500); 15 Aug 2013 13:55:21 -0000 Delivered-To: apmail-crunch-user-archive@crunch.apache.org Received: (qmail 88526 invoked by uid 500); 15 Aug 2013 13:55:16 -0000 Mailing-List: contact user-help@crunch.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@crunch.apache.org Delivered-To: mailing list user@crunch.apache.org Received: (qmail 88516 invoked by uid 99); 15 Aug 2013 13:55:14 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 15 Aug 2013 13:55:14 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of hpnole@gmail.com designates 209.85.160.42 as permitted sender) Received: from [209.85.160.42] (HELO mail-pb0-f42.google.com) (209.85.160.42) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 15 Aug 2013 13:55:10 +0000 Received: by mail-pb0-f42.google.com with SMTP id un15so741933pbc.29 for ; Thu, 15 Aug 2013 06:54:50 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=N0rNzgivkMQXW+z3shfLnoQy4+b8etYd9juAXoYuQ34=; b=D9u4Zobn/MwpIrt9WZhjG0//x7kgvLDI/YnCZ1zgYO1d2n9UCPI9p4r/No8uXc6cuE pvs46jm1kIC4MqIaYvRiV156TZ2WDqyvVF8wtVTwW2z1/tV5OU3iCeLhg2QWQjGIWMTG 80TWRK9Nizc3i308WNQZBpEBwxhEl3tR6VTbY31g9m35FYnRrSu4CIvTFvFI2+1IlKDi MnSTZJfzL1Rbc2KTwih0w77ED5Me7d3q+3AFf54UgDdTH8YM8JXMPdaIRWyLZRE5a18G 1Hh3L+S0fzWcQp7AZheO8ZMNLrBysdOeCVMr0g+6S1ObtYxa73M0yWMpLdrhMZNRi46N J93g== MIME-Version: 1.0 X-Received: by 10.68.237.3 with SMTP id uy3mr8212587pbc.155.1376574890132; Thu, 15 Aug 2013 06:54:50 -0700 (PDT) Received: by 10.68.184.197 with HTTP; Thu, 15 Aug 2013 06:54:50 -0700 (PDT) In-Reply-To: References: Date: Thu, 15 Aug 2013 08:54:50 -0500 Message-ID: Subject: Re: Crunch DoFn vs Mapper/reducer From: Narlin M To: Crunch users Content-Type: multipart/alternative; boundary=047d7b33d58e5549f004e3fcd16a X-Virus-Checked: Checked by ClamAV on apache.org --047d7b33d58e5549f004e3fcd16a Content-Type: text/plain; charset=ISO-8859-1 Thanks for the reply, Josh. I understand its function a bit better now. On Wed, Aug 14, 2013 at 5:50 PM, Josh Wills wrote: > Hey Narlin, > > DoFns are similar to the Mapper and Reducer classes that you would write > in classic MapReduce jobs-- they don't spawn MapReduce jobs themselves. The > Crunch planner will analyze the overall DAG of DoFns, groupByKeys, unions, > and combineValues operations and compile the DAG into one or more MapReduce > jobs, where each of the DoFns will be assigned to one of the Mappers or > Reducers in those jobs. Crunch has its own Mapper and Reducer > implementations (named CrunchMapper and CrunchReducer, naturally) that are > responsible for executing the DoFns that are assigned to each phase of the > job. > > In general, you should not need to use mapper and reducer classes when you > use Crunch, although if you have legacy Mapper and Reducer classes that you > would like to use in conjunction with the DoFns in a Crunch pipeline, there > is a collection of methods in org.apache.crunch.lib.MapReduce in Crunch > 0.7.0 that will wrap a given Mapper or Reducer class inside of a DoFn. > > Hope that helps. > > Best, > Josh > > > > On Wed, Aug 14, 2013 at 12:59 PM, Narlin M wrote: > >> I have just recently started using Crunch, having been recommended to use >> it instead of writing plain map reduce jobs. As I was going through the >> crunch documentation, some questions came to my mind. Am I correct in >> saying that the DoFn family of functions will internally spawn map-reduce >> jobs, so there is no need to write separate mapper or reducer classes? If >> so, I agree that this will abstract some of the lower level details from >> the programmer, but at the same time, does it not lower the programmer's >> control over the processing logic? >> >> Also, will there be situations when separate mapper / reducer classes >> will be required in addition to the DoFn functions? >> >> Thanks. >> > > > > -- > Director of Data Science > Cloudera > Twitter: @josh_wills > --047d7b33d58e5549f004e3fcd16a Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable
Thanks for the reply, Josh. I understand its function a bit better n= ow.


On Wed, Aug 14, 2013 at 5:50 PM, Josh Wills <jwills@cloudera.com> wrote:
Hey Narlin,

<= div>DoFns are similar to the Mapper and Reducer classes that you would writ= e in classic MapReduce jobs-- they don't spawn MapReduce jobs themselve= s. The Crunch planner will analyze the overall DAG of DoFns, groupByKeys, u= nions, and combineValues operations and compile the DAG into one or more Ma= pReduce jobs, where each of the DoFns will be assigned to one of the Mapper= s or Reducers in those jobs. Crunch has its own Mapper and Reducer implemen= tations (named CrunchMapper and CrunchReducer, naturally) that are responsi= ble for executing the DoFns that are assigned to each phase of the job.

In general, you should not need to use mapper and reduc= er classes when you use Crunch, although if you have legacy Mapper and Redu= cer classes that you would like to use in conjunction with the DoFns in a C= runch pipeline, there is a collection of methods in org.apache.crunch.lib.M= apReduce in Crunch 0.7.0 that will wrap a given Mapper or Reducer class ins= ide of a DoFn.

Hope that helps.

Best,
Josh
=


<= div class=3D"gmail_quote">On Wed, Aug 14, 2013 at 12:59 PM, Narlin M <
hpnol= e@gmail.com> wrote:
I have just recently started using Cr= unch, having been recommended to use it instead of writing plain map reduce= jobs. As I was going through the crunch documentation, some questions came= to my mind. Am I correct in saying that the DoFn family of functions will = internally spawn map-reduce jobs, so there is no need to write separate map= per or reducer classes? If so, I agree that this will abstract some of the = lower level details from the programmer, but at the same time, does it not = lower the programmer's control over the processing logic?

= Also, will there be situations when separate mapper / reducer classes will = be required in addition to the DoFn functions?

= Thanks.



--
Director of Data Scienc= e

--047d7b33d58e5549f004e3fcd16a--