crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gabriel Reid <>
Subject Re: ability to specify a different function for combiner & reducer
Date Wed, 23 Oct 2013 12:28:41 GMT
On Wed, Oct 23, 2013 at 2:10 PM, Josh Wills <> wrote:
> I certainly understand the issue; do you prefer the two-function solution
> to one in which we added a method in DoFn to indicate which phase of the MR
> job a particular DoFn was being executed in? We might have options like
> MAP, REDUCE, COMBINE, or IN_MEMORY. (I'm not totally sure if such a
> solution would work for all cases, so someone please call me out on that if
> there's something I'm missing.)

Do you mean a method in DoFn so that you could say "execute this
function in the reduce phase", or do you mean a method where you could
say "tell me which phase I'm operating in"?

If you're talking about the second case, I think that Stefan's
suggested solution will be perfectly compatible with that (i.e. pass
in a single CombineFn that is aware of which context it's running in),
while it would also be possible to explicitly pass in two different
implementations, which I think would be really interesting to have.

- Gabriel

> J
> On Wed, Oct 23, 2013 at 12:47 AM, Stefan De Smit <>wrote:
>> Hi,
>> I encountered a situation where I need different behaviour of my CombineFn
>> during combine & reduce phase.
>> Basically, I have a collection of avro records that I need to combine.
>> For some of these, I have so many records with same key that I need to
>> combine them first to make my job work (memory & timing constraints)
>> For others, I can't combine them, because I need all records together.
>> So, basically I would want to know in my function if it's combining or
>> reducing.
>> The only way to solve my problem in crunch right now seems to be to first
>> split my collection in 2 different collections, combine them separately &
>> union them again.
>> But this give a lot of overhead for something that would be supported by
>> native M/R.
>> I looked in the code and it seems that crunch internally has a NodeContext
>> object to indicate COMBINE or REDUCE, but this context is not accessible in
>> the DoFn.
>> As the (RT)Node object is an internal crunch object, it's also not a clean
>> solution to expose the NodeContext.
>> So, as a better solution, it would be possible to create a new method:
>> combineValues(combineFn, reduceFn) on PGroupedTable. The existing
>> combineValues(combineFn) is in that case just a convenience method for must
>> use cases, where the combineFn & reduceFn is the same function.
>> With this new method, I would be able to just create my combineFn twice &
>> pass a boolean in the constructor to indicate if it's combine or reduce.
>> I already made a patch to add this function, but as the procedure
>> indicates to discuss the change first, I'll write this mail first to check
>> what you think. (I also didn't test my patch yet, although all unit & IT
>> still pass)
>> Thanks
>> Stefan

View raw message