crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Josh Wills <jwi...@cloudera.com>
Subject Re: ability to specify a different function for combiner & reducer
Date Wed, 23 Oct 2013 12:55:09 GMT
On Wed, Oct 23, 2013 at 5:28 AM, Gabriel Reid <gabriel.reid@gmail.com>wrote:

> On Wed, Oct 23, 2013 at 2:10 PM, Josh Wills <josh.wills@gmail.com> wrote:
> > I certainly understand the issue; do you prefer the two-function solution
> > to one in which we added a method in DoFn to indicate which phase of the
> MR
> > job a particular DoFn was being executed in? We might have options like
> > MAP, REDUCE, COMBINE, or IN_MEMORY. (I'm not totally sure if such a
> > solution would work for all cases, so someone please call me out on that
> if
> > there's something I'm missing.)
> >
>
> Do you mean a method in DoFn so that you could say "execute this
> function in the reduce phase", or do you mean a method where you could
> say "tell me which phase I'm operating in"?
>

The latter.


>
> If you're talking about the second case, I think that Stefan's
> suggested solution will be perfectly compatible with that (i.e. pass
> in a single CombineFn that is aware of which context it's running in),
> while it would also be possible to explicitly pass in two different
> implementations, which I think would be really interesting to have.
>

Indeed-- my thought was that the functions would be pretty similar in most
cases, and that you would usually just want a switch for determining which
context you were in to reduce the code duplication. Of course, you could do
it yourself w/like a boolean flag on the constructor of the DoFn, which
might also be a good option-- wouldn't require a DoFn change.


>
> - Gabriel
>
>
>
>
> > J
> >
> >
> > On Wed, Oct 23, 2013 at 12:47 AM, Stefan De Smit <
> stefan.desmit@gmail.com>wrote:
> >
> >> Hi,
> >>
> >> I encountered a situation where I need different behaviour of my
> CombineFn
> >> during combine & reduce phase.
> >> Basically, I have a collection of avro records that I need to combine.
> >> For some of these, I have so many records with same key that I need to
> >> combine them first to make my job work (memory & timing constraints)
> >> For others, I can't combine them, because I need all records together.
> >> So, basically I would want to know in my function if it's combining or
> >> reducing.
> >> The only way to solve my problem in crunch right now seems to be to
> first
> >> split my collection in 2 different collections, combine them separately
> &
> >> union them again.
> >> But this give a lot of overhead for something that would be supported by
> >> native M/R.
> >>
> >> I looked in the code and it seems that crunch internally has a
> NodeContext
> >> object to indicate COMBINE or REDUCE, but this context is not
> accessible in
> >> the DoFn.
> >> As the (RT)Node object is an internal crunch object, it's also not a
> clean
> >> solution to expose the NodeContext.
> >> So, as a better solution, it would be possible to create a new method:
> >> combineValues(combineFn, reduceFn) on PGroupedTable. The existing
> >> combineValues(combineFn) is in that case just a convenience method for
> must
> >> use cases, where the combineFn & reduceFn is the same function.
> >> With this new method, I would be able to just create my combineFn twice
> &
> >> pass a boolean in the constructor to indicate if it's combine or reduce.
> >>
> >> I already made a patch to add this function, but as the procedure
> >> indicates to discuss the change first, I'll write this mail first to
> check
> >> what you think. (I also didn't test my patch yet, although all unit & IT
> >> still pass)
> >>
> >> Thanks
> >> Stefan
> >>
> >>
>



-- 
Director of Data Science
Cloudera <http://www.cloudera.com>
Twitter: @josh_wills <http://twitter.com/josh_wills>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message