crunch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Micah Whitacre <mkwhita...@gmail.com>
Subject Re: Understanding ScaleFactor in Crunch
Date Tue, 03 Nov 2015 16:54:47 GMT
There are some details on scaleFactor here.[1]  Essentially crunch would
use it for a couple of options:

1. Calculating the number of reducers to use when grouping and nothing is
specified
2. Optimizing to decrease how much I/O it has to do if possible.

In the last situation if your pipeline might will require state to persist
Crunch will try to optimize to do the least amount of I/O at the cost of
doing recalculations.  So it might persist to disk right before your DoFn
if it creates a significant amount more of data.  If you know the data
increase is that significant it would definitely be advisable to override
the method and give a more reasonable factor value.

>> Can I tell it to leverage more mappers/reducers in the DoFn?

Scale factors will be applicable for the number of reducers but shouldn't
affect mappers as that would be controlled by the input splits.

[1] - http://crunch.apache.org/user-guide.html#doplan
[2] -
https://github.com/apache/crunch/blob/188360048a7f2d3cedf5fc915b48d7671f1d8d46/crunch-core/src/main/java/org/apache/crunch/DoFn.java#L132

On Tue, Nov 3, 2015 at 10:38 AM, Robinson, Landon - Landon <
landon.t.robinson@lowes.com> wrote:

> All,
>
> I’m trying to understand how I might use scaleFactor() in my Crunch code.
> My use case is this: I have data that I read into a Pcollection that is
> smaller than my system’s block size, but when processed in a DoFn, *grows*
> pretty exponentially.
>
> So what started as a 10mb file might become 10 times larger.
>
> To prevent spills and memory issues, how could I leverage something like
> scaleFactor() (or whatever is needed) to indicate to the Crunch Planner
> that my resulting Pcollection will grow exponentially?
> Can I tell it to leverage more mappers/reducers in the DoFn?
>
> Guidance, if you could!
>
> Thanks,
> Landon
> ---------------------------------------------------------------------------
>
> Landon Robinson
> ---------------------------------------------------------------------------
> NOTICE: All information in and attached to the e-mails below may be
> proprietary, confidential, privileged and otherwise protected from improper
> or erroneous disclosure. If you are not the sender's intended recipient,
> you are not authorized to intercept, read, print, retain, copy, forward, or
> disseminate this message. If you have erroneously received this
> communication, please notify the sender immediately by phone (704-758-1000)
> or by e-mail and destroy all copies of this message electronic, paper, or
> otherwise.
>
> *By transmitting documents via this email: Users, Customers, Suppliers and
> Vendors collectively acknowledge and agree the transmittal of information
> via email is voluntary, is offered as a convenience, and is not a secured
> method of communication; Not to transmit any payment information E.G.
> credit card, debit card, checking account, wire transfer information,
> passwords, or sensitive and personal information E.G. Driver's license,
> DOB, social security, or any other information the user wishes to remain
> confidential; To transmit only non-confidential information such as plans,
> pictures and drawings and to assume all risk and liability for and
> indemnify Lowe's from any claims, losses or damages that may arise from the
> transmittal of documents or including non-confidential information in the
> body of an email transmittal. Thank you. *
>

Mime
View raw message