crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stefan De Smit <stefan.des...@gmail.com>
Subject ability to specify a different function for combiner & reducer
Date Wed, 23 Oct 2013 07:47:24 GMT
Hi,

I encountered a situation where I need different behaviour of my CombineFn
during combine & reduce phase.
Basically, I have a collection of avro records that I need to combine.
For some of these, I have so many records with same key that I need to
combine them first to make my job work (memory & timing constraints)
For others, I can't combine them, because I need all records together.
So, basically I would want to know in my function if it's combining or
reducing.
The only way to solve my problem in crunch right now seems to be to first
split my collection in 2 different collections, combine them separately &
union them again.
But this give a lot of overhead for something that would be supported by
native M/R.

I looked in the code and it seems that crunch internally has a NodeContext
object to indicate COMBINE or REDUCE, but this context is not accessible in
the DoFn.
As the (RT)Node object is an internal crunch object, it's also not a clean
solution to expose the NodeContext.
So, as a better solution, it would be possible to create a new method:
combineValues(combineFn, reduceFn) on PGroupedTable. The existing
combineValues(combineFn) is in that case just a convenience method for must
use cases, where the combineFn & reduceFn is the same function.
With this new method, I would be able to just create my combineFn twice &
pass a boolean in the constructor to indicate if it's combine or reduce.

I already made a patch to add this function, but as the procedure indicates
to discuss the change first, I'll write this mail first to check what you
think. (I also didn't test my patch yet, although all unit & IT still pass)

Thanks
Stefan

Mime
View raw message