crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Josh Wills <>
Subject Re: ability to specify a different function for combiner & reducer
Date Wed, 23 Oct 2013 12:10:00 GMT
I certainly understand the issue; do you prefer the two-function solution
to one in which we added a method in DoFn to indicate which phase of the MR
job a particular DoFn was being executed in? We might have options like
MAP, REDUCE, COMBINE, or IN_MEMORY. (I'm not totally sure if such a
solution would work for all cases, so someone please call me out on that if
there's something I'm missing.)


On Wed, Oct 23, 2013 at 12:47 AM, Stefan De Smit <>wrote:

> Hi,
> I encountered a situation where I need different behaviour of my CombineFn
> during combine & reduce phase.
> Basically, I have a collection of avro records that I need to combine.
> For some of these, I have so many records with same key that I need to
> combine them first to make my job work (memory & timing constraints)
> For others, I can't combine them, because I need all records together.
> So, basically I would want to know in my function if it's combining or
> reducing.
> The only way to solve my problem in crunch right now seems to be to first
> split my collection in 2 different collections, combine them separately &
> union them again.
> But this give a lot of overhead for something that would be supported by
> native M/R.
> I looked in the code and it seems that crunch internally has a NodeContext
> object to indicate COMBINE or REDUCE, but this context is not accessible in
> the DoFn.
> As the (RT)Node object is an internal crunch object, it's also not a clean
> solution to expose the NodeContext.
> So, as a better solution, it would be possible to create a new method:
> combineValues(combineFn, reduceFn) on PGroupedTable. The existing
> combineValues(combineFn) is in that case just a convenience method for must
> use cases, where the combineFn & reduceFn is the same function.
> With this new method, I would be able to just create my combineFn twice &
> pass a boolean in the constructor to indicate if it's combine or reduce.
> I already made a patch to add this function, but as the procedure
> indicates to discuss the change first, I'll write this mail first to check
> what you think. (I also didn't test my patch yet, although all unit & IT
> still pass)
> Thanks
> Stefan

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message