Return-Path: X-Original-To: apmail-crunch-user-archive@www.apache.org Delivered-To: apmail-crunch-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 6B795186D1 for ; Mon, 12 Oct 2015 22:58:09 +0000 (UTC) Received: (qmail 18147 invoked by uid 500); 12 Oct 2015 22:58:09 -0000 Delivered-To: apmail-crunch-user-archive@crunch.apache.org Received: (qmail 18108 invoked by uid 500); 12 Oct 2015 22:58:09 -0000 Mailing-List: contact user-help@crunch.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@crunch.apache.org Delivered-To: mailing list user@crunch.apache.org Received: (qmail 18098 invoked by uid 99); 12 Oct 2015 22:58:09 -0000 Received: from Unknown (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 12 Oct 2015 22:58:09 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id D0DAA180A58 for ; Mon, 12 Oct 2015 22:58:08 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 2.88 X-Spam-Level: ** X-Spam-Status: No, score=2.88 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=3, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=disabled Authentication-Results: spamd3-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-us-east.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id jWdFN_NBu34R for ; Mon, 12 Oct 2015 22:58:01 +0000 (UTC) Received: from mail-ob0-f170.google.com (mail-ob0-f170.google.com [209.85.214.170]) by mx1-us-east.apache.org (ASF Mail Server at mx1-us-east.apache.org) with ESMTPS id 469A942B59 for ; Mon, 12 Oct 2015 22:58:01 +0000 (UTC) Received: by obbzf10 with SMTP id zf10so434559obb.2 for ; Mon, 12 Oct 2015 15:58:01 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :content-type; bh=k/GLY13W+cuLPhAu55InQyLHLn7YC0e9yuVNz7QBIJc=; b=aqaT2DmuwUidkCKJHslekttr6JnZZGkbpWs35ATpbFQUioyWdx9+tHXsn5OW7+Rnpd L4zlX1ZeovK9Z9W9ZnqcNT4IfAM9SaCvAwPdBEEiP8Nf3XigCbEqvTg0e4C3DLvccsrT S5YKG+Z4RhM3+DphhkKPLSXsEzqxEi/6SkDrNbCb66MdXRG+TUJufQQINBYyIKUnsoNJ KMuZjjha8tH4vmHpiTa1LGru2yze6R8rZtMy4erW9F7Ji/4uzkjbZ9IOvylEyPuw02CJ DEjFTc19ZK94h3WBVuSYA4mGsZ5SmdrZZp5sIbvf/7DxpA371NU73wsC/W7VovFFjWHm 0OGA== X-Received: by 10.182.221.134 with SMTP id qe6mr17091462obc.56.1444690680809; Mon, 12 Oct 2015 15:58:00 -0700 (PDT) MIME-Version: 1.0 Received: by 10.202.228.137 with HTTP; Mon, 12 Oct 2015 15:57:41 -0700 (PDT) In-Reply-To: References: From: Josh Wills Date: Mon, 12 Oct 2015 15:57:41 -0700 Message-ID: Subject: Re: crunch planner parameters To: "user@crunch.apache.org" Content-Type: multipart/alternative; boundary=001a11c30408d70fde0521f042f8 --001a11c30408d70fde0521f042f8 Content-Type: text/plain; charset=UTF-8 It is the latter approach, yes. The former would be better. J On Mon, Oct 12, 2015 at 3:56 PM, Everett Anderson wrote: > Hey Josh, > > Somewhat related question -- when computing the number of reducers, is the > planner doing that at the start of each MR job, estimating the size of the > map output and then calculating number of reducers based on the input data > size going into the job? > > Or does it make the calculation at the very beginning of the pipeline > after reading the sources? > > The former might be more accurate, with the latter suffering a compounding > effect from poor estimation at any step. > > > > On Mon, Oct 12, 2015 at 3:46 PM, Josh Wills wrote: > >> No, just the number of tasks involved in each job. The structure should >> remain the same. >> >> J >> >> On Mon, Oct 12, 2015 at 3:44 PM, Ravi Kolluri wrote: >> >>> >>> Thanks Josh! >>> >>> My question was more about how the planner organizes the map-reduce >>> computation. Would the crunch job composition change based on input size? >>> >>> thanks, >>> Ravi >>> >>> >>> On Mon, Oct 12, 2015 at 3:38 PM, Josh Wills >>> wrote: >>> >>>> Hey Ravi, >>>> >>>> The number of reducers used in the various stages of the MR job can >>>> change if you don't hard-code them using groupByKey(int numReducers) or >>>> groupByKey(GroupingOptions) (or the equivalent settings via the >>>> JoinStrategy classes for joins). The planner will try to estimate the >>>> number of bytes to be processed and aims to process 1GB of data per >>>> reducer. If you do hard-code the number of reduce tasks, the planner will >>>> respect your wishes no matter what the input size is. >>>> >>>> Josh >>>> >>>> On Mon, Oct 12, 2015 at 2:31 PM, Ravi Kolluri wrote: >>>> >>>>> Hello Crunch users, >>>>> >>>>> I have a question about what parameters go into the Crunch planner. >>>>> >>>>> Lets say I have a crunch job with a set of input tables, and a fixed >>>>> set of calls to parallelDo and groupBy operations. Does the crunch >>>>> execution plan stay fixed independent of the size distribution of the >>>>> inputs? >>>>> >>>>> thanks, >>>>> Ravi >>>>> >>>>> >>>>> *DISCLAIMER:* The contents of this email, including any attachments, >>>>> may contain information that is confidential, proprietary in nature, >>>>> protected health information (PHI), or otherwise protected by law from >>>>> disclosure, and is solely for the use of the intended recipient(s). If you >>>>> are not the intended recipient, you are hereby notified that any use, >>>>> disclosure or copying of this email, including any attachments, is >>>>> unauthorized and strictly prohibited. If you have received this email in >>>>> error, please notify the sender of this email. Please delete this and all >>>>> copies of this email from your system. Any opinions either expressed or >>>>> implied in this email and all attachments, are those of its author only, >>>>> and do not necessarily reflect those of Nuna Health, Inc. >>>> >>>> >>>> >>> >>> *DISCLAIMER:* The contents of this email, including any attachments, >>> may contain information that is confidential, proprietary in nature, >>> protected health information (PHI), or otherwise protected by law from >>> disclosure, and is solely for the use of the intended recipient(s). If you >>> are not the intended recipient, you are hereby notified that any use, >>> disclosure or copying of this email, including any attachments, is >>> unauthorized and strictly prohibited. If you have received this email in >>> error, please notify the sender of this email. Please delete this and all >>> copies of this email from your system. Any opinions either expressed or >>> implied in this email and all attachments, are those of its author only, >>> and do not necessarily reflect those of Nuna Health, Inc. >>> >> >> > > *DISCLAIMER:* The contents of this email, including any attachments, may > contain information that is confidential, proprietary in nature, protected > health information (PHI), or otherwise protected by law from disclosure, > and is solely for the use of the intended recipient(s). If you are not the > intended recipient, you are hereby notified that any use, disclosure or > copying of this email, including any attachments, is unauthorized and > strictly prohibited. If you have received this email in error, please > notify the sender of this email. Please delete this and all copies of this > email from your system. Any opinions either expressed or implied in this > email and all attachments, are those of its author only, and do not > necessarily reflect those of Nuna Health, Inc. > --001a11c30408d70fde0521f042f8 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
It is the latter approach, yes. The former would be better= .

J

On Mon, Oct 12, 2015 at 3:56 PM, Everett Anderson <everett= @nuna.com> wrote:
Hey Josh,

Somewhat related question -- when com= puting the number of reducers, is the planner doing that at the start of ea= ch MR job, estimating the size of the map output and then calculating numbe= r of reducers based on the input data size going into the job?
Or does it make the calculation at the very beginning of the p= ipeline after reading the sources?

The former migh= t be more accurate, with the latter suffering a compounding effect from poo= r estimation at any step.



On Mon, Oct 12, 2015 at 3:46 PM, Josh Wills <josh.will= s@gmail.com> wrote:
No, just the number of tasks involved in each job. The structure = should remain the same.

J

On Mon, Oct 12, 2015 at 3:44 PM, Ravi Kolluri <= ravi@nuna.com> wrote:

Tha= nks Josh!

My question was more about how the plann= er organizes the map-reduce computation. Would the crunch job composition c= hange based on input size?

thanks,
Ravi<= /div>


On Mon, Oct 12, 2015 at 3:38 PM, Josh Wills <josh.wi= lls@gmail.com> wrote:
Hey Ravi,

The number of reducers used in the= various stages of the MR job can change if you don't hard-code them us= ing groupByKey(int numReducers) or groupByKey(GroupingOptions) (or the equi= valent settings via the JoinStrategy classes for joins). The planner will t= ry to estimate the number of bytes to be processed and aims to process 1GB = of data per reducer. If you do hard-code the number of reduce tasks, the pl= anner will respect your wishes no matter what the input size is.
=
Josh

On Mon, Oct 12, 2015 at 2:31 PM, Ravi Kolluri <ravi@nu= na.com> wrote:
=
Hello Crunch use= rs,

I have a question about what parameters go into the Crunch planner.= =C2=A0

Lets say I have a crunch job with a set of input tables, and a f= ixed set of calls to parallelDo and groupBy operations. Does the crunch exe= cution plan stay fixed independent of the size distribution of the inputs?<= /div>

thanks,
Ravi


DISCLAIMER:=C2= =A0The contents of this email, including any attachments, may contain infor= mation that is confidential, proprietary in nature, protected health inform= ation (PHI), or otherwise protected by law from disclosure, and is solely f= or the use of the intended recipient(s). If you are not the intended recipi= ent, you are hereby notified that any use, disclosure or copying of this em= ail, including any attachments, is unauthorized and strictly prohibited. If= you have received this email in error, please notify the sender of this em= ail. Please delete this and all copies of this email from your system. Any = opinions either expressed or implied in this email and all attachments, are= those of its author only, and do not necessarily reflect those of Nuna Hea= lth, Inc.



DISCLAIMER:=C2=A0The conten= ts of this email, including any attachments, may contain information that i= s confidential, proprietary in nature, protected health information (PHI), = or otherwise protected by law from disclosure, and is solely for the use of= the intended recipient(s). If you are not the intended recipient, you are = hereby notified that any use, disclosure or copying of this email, includin= g any attachments, is unauthorized and strictly prohibited. If you have rec= eived this email in error, please notify the sender of this email. Please d= elete this and all copies of this email from your system. Any opinions eith= er expressed or implied in this email and all attachments, are those of its= author only, and do not necessarily reflect those of Nuna Health, Inc.



DISCLAIMER:=C2=A0The conten= ts of this email, including any attachments, may contain information that i= s confidential, proprietary in nature, protected health information (PHI), = or otherwise protected by law from disclosure, and is solely for the use of= the intended recipient(s). If you are not the intended recipient, you are = hereby notified that any use, disclosure or copying of this email, includin= g any attachments, is unauthorized and strictly prohibited. If you have rec= eived this email in error, please notify the sender of this email. Please d= elete this and all copies of this email from your system. Any opinions eith= er expressed or implied in this email and all attachments, are those of its= author only, and do not necessarily reflect those of Nuna Health, Inc.

--001a11c30408d70fde0521f042f8--