crunch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jeff Quinn <j...@nuna.com>
Subject Re: Understanding ScaleFactor in Crunch
Date Tue, 03 Nov 2015 19:04:51 GMT
Just wanted to add a few things I have learned about scale factors. Please
anyone correct me if I am wrong, this is just my intuitive understanding.

When you have an input Source, crunch will measure the number of bytes for
that input source (using HDFS APIs, essentially like running `hadoop fs
-ls` and adding up the size of all the input files. Then each DoFn that is
applied over that input Source has its own scale factor, which are
multiplied against the original Source size. Finally when crunch gets to a
GBK stage, it will use this derived size estimate, with the
`crunch.bytes.per.reduce.task` and `crunch.max.reducers` config params to
determine the number of reducers. That calculation happens here in the code:

https://github.com/apache/crunch/blob/master/crunch-core/src/main/java/org/apache/crunch/util/PartitionUtils.java

So for example if you had a pipeline like this:

(Input) [10Bytes]  --> (DoFn1)[Scale Factor: 2.0] --> (DoFn2)[Scale Factor:
3.0] --> (GBK)

and crunch.bytes.per.reduce.task: 10 Bytes and crunch.max.reducers: 5

Crunch would estimate 10 Bytes * 2.0 * 3.0 = 60 Bytes of data at the GBK
vertex. It would divide that by crunch.bytes.per.reduce.task which gives us
60 Bytes / 10 Bytes = 6 Reduce Tasks. However 6 > crunch.max.reducers, so
it would launch an MR job with just 5 Reduce tasks, which is the value of
crunch.max.reducers


On Tue, Nov 3, 2015 at 8:54 AM, Micah Whitacre <mkwhitacre@gmail.com> wrote:

> There are some details on scaleFactor here.[1]  Essentially crunch would
> use it for a couple of options:
>
> 1. Calculating the number of reducers to use when grouping and nothing is
> specified
> 2. Optimizing to decrease how much I/O it has to do if possible.
>
> In the last situation if your pipeline might will require state to persist
> Crunch will try to optimize to do the least amount of I/O at the cost of
> doing recalculations.  So it might persist to disk right before your DoFn
> if it creates a significant amount more of data.  If you know the data
> increase is that significant it would definitely be advisable to override
> the method and give a more reasonable factor value.
>
> >> Can I tell it to leverage more mappers/reducers in the DoFn?
>
> Scale factors will be applicable for the number of reducers but shouldn't
> affect mappers as that would be controlled by the input splits.
>
> [1] - http://crunch.apache.org/user-guide.html#doplan
> [2] -
> https://github.com/apache/crunch/blob/188360048a7f2d3cedf5fc915b48d7671f1d8d46/crunch-core/src/main/java/org/apache/crunch/DoFn.java#L132
>
> On Tue, Nov 3, 2015 at 10:38 AM, Robinson, Landon - Landon <
> landon.t.robinson@lowes.com> wrote:
>
>> All,
>>
>> I’m trying to understand how I might use scaleFactor() in my Crunch code.
>> My use case is this: I have data that I read into a Pcollection that is
>> smaller than my system’s block size, but when processed in a DoFn,
>> *grows* pretty exponentially.
>>
>> So what started as a 10mb file might become 10 times larger.
>>
>> To prevent spills and memory issues, how could I leverage something like
>> scaleFactor() (or whatever is needed) to indicate to the Crunch Planner
>> that my resulting Pcollection will grow exponentially?
>> Can I tell it to leverage more mappers/reducers in the DoFn?
>>
>> Guidance, if you could!
>>
>> Thanks,
>> Landon
>>
>> ---------------------------------------------------------------------------
>>
>> Landon Robinson
>>
>> ---------------------------------------------------------------------------
>> NOTICE: All information in and attached to the e-mails below may be
>> proprietary, confidential, privileged and otherwise protected from improper
>> or erroneous disclosure. If you are not the sender's intended recipient,
>> you are not authorized to intercept, read, print, retain, copy, forward, or
>> disseminate this message. If you have erroneously received this
>> communication, please notify the sender immediately by phone
>> (704-758-1000) or by e-mail and destroy all copies of this message
>> electronic, paper, or otherwise.
>>
>> *By transmitting documents via this email: Users, Customers, Suppliers
>> and Vendors collectively acknowledge and agree the transmittal of
>> information via email is voluntary, is offered as a convenience, and is not
>> a secured method of communication; Not to transmit any payment information
>> E.G. credit card, debit card, checking account, wire transfer information,
>> passwords, or sensitive and personal information E.G. Driver's license,
>> DOB, social security, or any other information the user wishes to remain
>> confidential; To transmit only non-confidential information such as plans,
>> pictures and drawings and to assume all risk and liability for and
>> indemnify Lowe's from any claims, losses or damages that may arise from the
>> transmittal of documents or including non-confidential information in the
>> body of an email transmittal. Thank you. *
>>
>
>

-- 
*DISCLAIMER:* The contents of this email, including any attachments, may 
contain information that is confidential, proprietary in nature, protected 
health information (PHI), or otherwise protected by law from disclosure, 
and is solely for the use of the intended recipient(s). If you are not the 
intended recipient, you are hereby notified that any use, disclosure or 
copying of this email, including any attachments, is unauthorized and 
strictly prohibited. If you have received this email in error, please 
notify the sender of this email. Please delete this and all copies of this 
email from your system. Any opinions either expressed or implied in this 
email and all attachments, are those of its author only, and do not 
necessarily reflect those of Nuna Health, Inc.

Mime
View raw message