crunch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From David Ortiz <>
Subject Re: Force new Map phase
Date Mon, 27 Jul 2015 22:45:45 GMT
I'll give that a try on your morning.  Thanks.

On Mon, Jul 27, 2015, 6:02 PM Josh Wills <> wrote:

> Hey David,
> The easiest way is to insert a PCollection.cache() call at the stage
> between the two joins where you think the reduce phase should end and the
> next map phase should begin. When the Crunch planner makes the decision of
> where to split the work between a reducer/mapper, it tries to respect any
> explicit cache() calls that it encounters.
> Josh
> On Mon, Jul 27, 2015 at 2:58 PM, David Ortiz <>
> wrote:
>>  Hey,
>>      Are there any easy tricks to force a new map stage to kick off?  I
>> know I can force a reduce with GBK operations, but I am running into an
>> issue where one of our jobs is having issues with data skew, and from what
>> I can tell, the issue is we are getting a couple hot keys that join
>> properly, but then when trying to do the follow up processing that comes
>> before the next join, the reducer hits the GC Overhead Limit.  Based on the
>> dot file, it is trying to do all the preprocessing for the next join in the
>> reducer from the first join, but it could easily do it in the map phase
>> before the next join in the pipeline without any issues, and I think this
>> would also get past the issue we’re having with memory.  The only solution
>> I could think of to try and do this at the moment, is to do everything up
>> to the first join, call pipeline.done(), then add some more operations
>> before another pipeline.done() operation.
>> Thanks,
>>     Dave
>>  *This email is intended only for the use of the individual(s) to whom
>> it is addressed. If you have received this communication in error, please
>> immediately notify the sender and delete the original email.*
> --
> Director of Data Science
> Cloudera <>
> Twitter: @josh_wills <>

View raw message