crunch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Josh Wills <>
Subject Re: Force new Map phase
Date Mon, 27 Jul 2015 22:01:54 GMT
Hey David,

The easiest way is to insert a PCollection.cache() call at the stage
between the two joins where you think the reduce phase should end and the
next map phase should begin. When the Crunch planner makes the decision of
where to split the work between a reducer/mapper, it tries to respect any
explicit cache() calls that it encounters.


On Mon, Jul 27, 2015 at 2:58 PM, David Ortiz <>

>  Hey,
>      Are there any easy tricks to force a new map stage to kick off?  I
> know I can force a reduce with GBK operations, but I am running into an
> issue where one of our jobs is having issues with data skew, and from what
> I can tell, the issue is we are getting a couple hot keys that join
> properly, but then when trying to do the follow up processing that comes
> before the next join, the reducer hits the GC Overhead Limit.  Based on the
> dot file, it is trying to do all the preprocessing for the next join in the
> reducer from the first join, but it could easily do it in the map phase
> before the next join in the pipeline without any issues, and I think this
> would also get past the issue we’re having with memory.  The only solution
> I could think of to try and do this at the moment, is to do everything up
> to the first join, call pipeline.done(), then add some more operations
> before another pipeline.done() operation.
> Thanks,
>     Dave
>  *This email is intended only for the use of the individual(s) to whom it
> is addressed. If you have received this communication in error, please
> immediately notify the sender and delete the original email.*

Director of Data Science
Cloudera <>
Twitter: @josh_wills <>

View raw message