crunch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From David Ortiz <>
Subject Re: Force new Map phase
Date Mon, 27 Jul 2015 22:48:03 GMT
*in the morning

On Mon, Jul 27, 2015, 6:45 PM David Ortiz <> wrote:

> I'll give that a try on your morning.  Thanks.
> On Mon, Jul 27, 2015, 6:02 PM Josh Wills <> wrote:
>> Hey David,
>> The easiest way is to insert a PCollection.cache() call at the stage
>> between the two joins where you think the reduce phase should end and the
>> next map phase should begin. When the Crunch planner makes the decision of
>> where to split the work between a reducer/mapper, it tries to respect any
>> explicit cache() calls that it encounters.
>> Josh
>> On Mon, Jul 27, 2015 at 2:58 PM, David Ortiz <>
>> wrote:
>>>  Hey,
>>>      Are there any easy tricks to force a new map stage to kick off?  I
>>> know I can force a reduce with GBK operations, but I am running into an
>>> issue where one of our jobs is having issues with data skew, and from what
>>> I can tell, the issue is we are getting a couple hot keys that join
>>> properly, but then when trying to do the follow up processing that comes
>>> before the next join, the reducer hits the GC Overhead Limit.  Based on the
>>> dot file, it is trying to do all the preprocessing for the next join in the
>>> reducer from the first join, but it could easily do it in the map phase
>>> before the next join in the pipeline without any issues, and I think this
>>> would also get past the issue we’re having with memory.  The only solution
>>> I could think of to try and do this at the moment, is to do everything up
>>> to the first join, call pipeline.done(), then add some more operations
>>> before another pipeline.done() operation.
>>> Thanks,
>>>     Dave
>>>  *This email is intended only for the use of the individual(s) to whom
>>> it is addressed. If you have received this communication in error, please
>>> immediately notify the sender and delete the original email.*
>> --
>> Director of Data Science
>> Cloudera <>
>> Twitter: @josh_wills <>

View raw message