Return-Path: X-Original-To: apmail-crunch-user-archive@www.apache.org Delivered-To: apmail-crunch-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 2B8B218A89 for ; Tue, 3 Nov 2015 16:55:00 +0000 (UTC) Received: (qmail 68600 invoked by uid 500); 3 Nov 2015 16:54:57 -0000 Delivered-To: apmail-crunch-user-archive@crunch.apache.org Received: (qmail 68558 invoked by uid 500); 3 Nov 2015 16:54:57 -0000 Mailing-List: contact user-help@crunch.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@crunch.apache.org Delivered-To: mailing list user@crunch.apache.org Received: (qmail 68540 invoked by uid 99); 3 Nov 2015 16:54:57 -0000 Received: from Unknown (HELO spamd2-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 03 Nov 2015 16:54:57 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id 92A751A2864 for ; Tue, 3 Nov 2015 16:54:56 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 2.88 X-Spam-Level: ** X-Spam-Status: No, score=2.88 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=3, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=disabled Authentication-Results: spamd2-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-us-west.apache.org ([10.40.0.8]) by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024) with ESMTP id 5uoV7IyNExNL for ; Tue, 3 Nov 2015 16:54:48 +0000 (UTC) Received: from mail-ob0-f177.google.com (mail-ob0-f177.google.com [209.85.214.177]) by mx1-us-west.apache.org (ASF Mail Server at mx1-us-west.apache.org) with ESMTPS id 4D8AE23869 for ; Tue, 3 Nov 2015 16:54:48 +0000 (UTC) Received: by obdgf3 with SMTP id gf3so17075275obd.3 for ; Tue, 03 Nov 2015 08:54:47 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=3FVRjdIa9d05jGnWkTXn0ybfjvCc6t/xT2I1+iG9mD0=; b=SxcZ9qYn4SrhW1Tm8eZE5bHsLCX+0PKeK72CdHKmtj0Fv9PVChwdjfiWWW4lKNnEMx VN/WjSOclh2d3ATmfvza6KRWPrnOEgjDmhUAKTRvX0pr6bQDFwVKd3S9cg1FQyUTJbXP g6RewjkRZM4EFI1WeOnWQLuInqlf0Xw2zhxWN0GlkiEvHxLk85XjrFmAj73tuTlCmxIQ yU5PdHXiMf/kzlE0gfCBfXOrAoT21eBWR8+rahcwsJvP0CWJeoj6F7WKkULo4XnmWfEN XtpjgDrsooWLbtsW1N5i/F3cn15hdPVrTOy+bDCIJYmagpxvsCVJJBFImUnGBbEYwWOQ YyfQ== MIME-Version: 1.0 X-Received: by 10.60.220.135 with SMTP id pw7mr18646403oec.51.1446569687654; Tue, 03 Nov 2015 08:54:47 -0800 (PST) Received: by 10.202.107.148 with HTTP; Tue, 3 Nov 2015 08:54:47 -0800 (PST) In-Reply-To: References: Date: Tue, 3 Nov 2015 10:54:47 -0600 Message-ID: Subject: Re: Understanding ScaleFactor in Crunch From: Micah Whitacre To: user@crunch.apache.org Content-Type: multipart/alternative; boundary=001a1135f99e6025f60523a5c046 --001a1135f99e6025f60523a5c046 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable There are some details on scaleFactor here.[1] Essentially crunch would use it for a couple of options: 1. Calculating the number of reducers to use when grouping and nothing is specified 2. Optimizing to decrease how much I/O it has to do if possible. In the last situation if your pipeline might will require state to persist Crunch will try to optimize to do the least amount of I/O at the cost of doing recalculations. So it might persist to disk right before your DoFn if it creates a significant amount more of data. If you know the data increase is that significant it would definitely be advisable to override the method and give a more reasonable factor value. >> Can I tell it to leverage more mappers/reducers in the DoFn? Scale factors will be applicable for the number of reducers but shouldn't affect mappers as that would be controlled by the input splits. [1] - http://crunch.apache.org/user-guide.html#doplan [2] - https://github.com/apache/crunch/blob/188360048a7f2d3cedf5fc915b48d7671f1d8= d46/crunch-core/src/main/java/org/apache/crunch/DoFn.java#L132 On Tue, Nov 3, 2015 at 10:38 AM, Robinson, Landon - Landon < landon.t.robinson@lowes.com> wrote: > All, > > I=E2=80=99m trying to understand how I might use scaleFactor() in my Crun= ch code. > My use case is this: I have data that I read into a Pcollection that is > smaller than my system=E2=80=99s block size, but when processed in a DoFn= , *grows* > pretty exponentially. > > So what started as a 10mb file might become 10 times larger. > > To prevent spills and memory issues, how could I leverage something like > scaleFactor() (or whatever is needed) to indicate to the Crunch Planner > that my resulting Pcollection will grow exponentially? > Can I tell it to leverage more mappers/reducers in the DoFn? > > Guidance, if you could! > > Thanks, > Landon > -------------------------------------------------------------------------= -- > > Landon Robinson > -------------------------------------------------------------------------= -- > NOTICE: All information in and attached to the e-mails below may be > proprietary, confidential, privileged and otherwise protected from improp= er > or erroneous disclosure. If you are not the sender's intended recipient, > you are not authorized to intercept, read, print, retain, copy, forward, = or > disseminate this message. If you have erroneously received this > communication, please notify the sender immediately by phone (704-758-100= 0) > or by e-mail and destroy all copies of this message electronic, paper, or > otherwise. > > *By transmitting documents via this email: Users, Customers, Suppliers an= d > Vendors collectively acknowledge and agree the transmittal of information > via email is voluntary, is offered as a convenience, and is not a secured > method of communication; Not to transmit any payment information E.G. > credit card, debit card, checking account, wire transfer information, > passwords, or sensitive and personal information E.G. Driver's license, > DOB, social security, or any other information the user wishes to remain > confidential; To transmit only non-confidential information such as plans= , > pictures and drawings and to assume all risk and liability for and > indemnify Lowe's from any claims, losses or damages that may arise from t= he > transmittal of documents or including non-confidential information in the > body of an email transmittal. Thank you. * > --001a1135f99e6025f60523a5c046 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
There are some details on scaleFactor here.[1] =C2=A0Essen= tially crunch would use it for a couple of options:

1. C= alculating the number of reducers to use when grouping and nothing is speci= fied
2. Optimizing to decrease how much I/O it has to do if possi= ble.

In the last situation if your pipeline might = will require state to persist Crunch will try to optimize to do the least a= mount of I/O at the cost of doing recalculations.=C2=A0 So it might persist= to disk right before your DoFn if it creates a significant amount more of = data.=C2=A0 If you know the data increase is that significant it would defi= nitely be advisable to override the method and give a more reasonable facto= r value.

>>=C2=A0Can I tell it to levera= ge more mappers/reducers in the DoFn?

Scale factors will be applicable for the number of reducers but s= houldn't affect mappers as that would be controlled by the input splits= .


On Tue, Nov 3, 2015 at 10:38 AM, Robinson, Landon = - Landon <landon.t.robinson@lowes.com> wrote:
<= blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px= #ccc solid;padding-left:1ex">
All,

I=E2=80=99m trying to understand how I might use scaleFactor() in my C= runch code.
My use case is this: I have data that I read into a Pcollection that i= s smaller than my system=E2=80=99s block size, but when processed in a DoFn= , grows pretty exponentially.

So what started as a 10mb file might become 10 times larger.

To prevent spills and memory issues, how could I leverage something li= ke scaleFactor() (or whatever is needed) to indicate to the Crunch Planner = that my resulting Pcollection will grow exponentially?
Can I tell it to leverage more mappers/reducers in the DoFn?

Guidance, if you could!

Thanks,
Landon
----------------------------------------------------------------------= -----

Landon Robinson
---------------------------------------------------------------------------=
NOTICE: All information in and attached to the e-mails below may be proprie= tary, confidential, privileged and otherwise protected from improper or err= oneous disclosure. If you are not the sender's intended recipient, you = are not authorized to intercept, read, print, retain, copy, forward, or dis= seminate this message. If you have erroneously received this communication,= please notify the sender immediately by phone (704-758-1000) or by e-mai= l and destroy all copies of this message electronic, paper, or otherwise.

By transmitting documents via this email: Users, Customers, Suppliers an= d Vendors collectively acknowledge and agree the transmittal of information= via email is voluntary, is offered as a convenience, and is not a secured = method of communication; Not to transmit any payment information E.G. credi= t card, debit card, checking account, wire transfer information, passwords,= or sensitive and personal information E.G. Driver's license, DOB, soci= al security, or any other information the user wishes to remain confidentia= l; To transmit only non-confidential information such as plans, pictures an= d drawings and to assume all risk and liability for and indemnify Lowe'= s from any claims, losses or damages that may arise from the transmittal of= documents or including non-confidential information in the body of an emai= l transmittal. Thank you.

--001a1135f99e6025f60523a5c046--