crunch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Josh Wills <>
Subject Re: Same processing in two m/r jobs
Date Thu, 14 Nov 2013 23:20:31 GMT
Closing the loop on this one-- I'm fixing the underlying issue in:

On Tue, Nov 12, 2013 at 10:04 AM, Josh Wills <> wrote:

> I'm surprised that it's still writing out S1-- are you changing the
> downstream operations (PTable.keys and GBK) to read from the union of S2
> and S3? If so, I'd like to try to recreate that, it sounds like a planner
> bug.
> Whenever there is a dependency between two GBK operations, the planner
> analyzes the operations that link those two GBKs and tries to find a "good"
> place to split the pipeline into two separate MR jobs. Good is usually
> based on the planner's rough estimate of how much data will be written by
> each of the DoFns, which is largely determined by the value of the float
> scaleFactor() function for each DoFn-- scaleFactor() > 1.0 means that the
> DoFn is expected to write out more data than it reads in, scaleFactor() <
> 1.0 means that the function is expected to write less.
> The only exception to that rule is if you are already writing a read/write
> output file at some point along the chain of operations between the two
> GBKs, at which point the planner will just choose that file as the split
> point. Note that text outputs are write-only, Crunch does not assume that
> it can read back a text file as it was written unless it is a
> PCollection<String>.
> J
> On Tue, Nov 12, 2013 at 9:31 AM, Mungre,Surbhi <>wrote:
>>  Hey Josh,
>> Thanks for the reply! I think we will be able to get around this issue
>> by materializing the output of union of S2 and S3.
>> However, the DAG shows that the first job is still writing output of S1
>> to the disk. Out of curiosity, how the planner decides to write output of
>> S1 to the disk instead of writing the output of union of S2 and S3?
>>  -Surbhi
>>   From: Josh Wills <>
>> Reply-To: "" <>
>> Date: Tuesday, November 12, 2013 10:26 AM
>> To: "" <>
>> Subject: Re: Same processing in two m/r jobs
>>   Hey Surbhi,
>>  The planner is trying to minimize the amount of data it writes to disk
>> at the end of the first job; it doesn't usually worry so much about
>> re-running the same computation in two different jobs if it means that less
>> data will be written to disk overall, since most MR jobs aren't CPU bound.
>> While that's often a useful heuristic, there are many cases where it
>> isn't true, and this sounds like one of them. My advice would be to
>> materialize the output of the union of S2 and S3, at which point the
>> planner should run the processing of S2 and S3 once at the end of job 1,
>> and then pick up that materialized output for grouping in job 2.
>>  Best,
>> Josh
>> On Mon, Nov 11, 2013 at 8:30 PM, Mungre,Surbhi <>wrote:
>>>  Background:
>>> We have a crunch pipeline which is used to normalize and standardize
>>> some entities represented as Avro. In our pipeline, we also capture some
>>> context information about the errors and warnings which we encounter during
>>> our processing. We pass a pair of context information and Avro entities in
>>> our pipeline. At the end of the pipeline, the context information is
>>> written to HDFS and Avro entities are written to HFiles.
>>> Problem:
>>> When we were trying to analyze DAG for our crunch pipeline we noticed that same
processing is done in two m/r jobs. Once it is done to capture context information and second
time it is done to generate HFiles. I wrote a test which replicates this issue with a simple
example. The test and a DAG created from this test are attached with the post. It is clear
from the DAG that S2 and S3 are processed twice. I am not sure why this processing is done
twice and if there is any way to avoid this behavior.
>>> Surbhi Mungre
>>> Software Engineer
>>>  CONFIDENTIALITY NOTICE This message and any included attachments are
>>> from Cerner Corporation and are intended only for the addressee. The
>>> information contained in this message is confidential and may constitute
>>> inside or non-public information under international, federal, or state
>>> securities laws. Unauthorized forwarding, printing, copying, distribution,
>>> or use of such information is strictly prohibited and may be unlawful. If
>>> you are not the addressee, please promptly delete this message and notify
>>> the sender of the delivery error by e-mail or you may call Cerner's
>>> corporate offices in Kansas City, Missouri, U.S.A at (+1) (816)221-1024.
>>  --
>> Director of Data Science
>> Cloudera<>
>> Twitter: @josh_wills<>
> --
> Director of Data Science
> Cloudera <>
> Twitter: @josh_wills <>

Director of Data Science
Cloudera <>
Twitter: @josh_wills <>

View raw message