crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gabriel Reid <gabriel.r...@gmail.com>
Subject Re: ability to specify a different function for combiner & reducer
Date Thu, 24 Oct 2013 13:48:48 GMT
Stefan, could you put this in Jira
(https://issues.apache.org/jira/browse/CRUNCH) and add the patch
there?

- Gabriel

On Wed, Oct 23, 2013 at 2:55 PM, Josh Wills <jwills@cloudera.com> wrote:
> On Wed, Oct 23, 2013 at 5:28 AM, Gabriel Reid <gabriel.reid@gmail.com>wrote:
>
>> On Wed, Oct 23, 2013 at 2:10 PM, Josh Wills <josh.wills@gmail.com> wrote:
>> > I certainly understand the issue; do you prefer the two-function solution
>> > to one in which we added a method in DoFn to indicate which phase of the
>> MR
>> > job a particular DoFn was being executed in? We might have options like
>> > MAP, REDUCE, COMBINE, or IN_MEMORY. (I'm not totally sure if such a
>> > solution would work for all cases, so someone please call me out on that
>> if
>> > there's something I'm missing.)
>> >
>>
>> Do you mean a method in DoFn so that you could say "execute this
>> function in the reduce phase", or do you mean a method where you could
>> say "tell me which phase I'm operating in"?
>>
>
> The latter.
>
>
>>
>> If you're talking about the second case, I think that Stefan's
>> suggested solution will be perfectly compatible with that (i.e. pass
>> in a single CombineFn that is aware of which context it's running in),
>> while it would also be possible to explicitly pass in two different
>> implementations, which I think would be really interesting to have.
>>
>
> Indeed-- my thought was that the functions would be pretty similar in most
> cases, and that you would usually just want a switch for determining which
> context you were in to reduce the code duplication. Of course, you could do
> it yourself w/like a boolean flag on the constructor of the DoFn, which
> might also be a good option-- wouldn't require a DoFn change.
>
>
>>
>> - Gabriel
>>
>>
>>
>>
>> > J
>> >
>> >
>> > On Wed, Oct 23, 2013 at 12:47 AM, Stefan De Smit <
>> stefan.desmit@gmail.com>wrote:
>> >
>> >> Hi,
>> >>
>> >> I encountered a situation where I need different behaviour of my
>> CombineFn
>> >> during combine & reduce phase.
>> >> Basically, I have a collection of avro records that I need to combine.
>> >> For some of these, I have so many records with same key that I need to
>> >> combine them first to make my job work (memory & timing constraints)
>> >> For others, I can't combine them, because I need all records together.
>> >> So, basically I would want to know in my function if it's combining or
>> >> reducing.
>> >> The only way to solve my problem in crunch right now seems to be to
>> first
>> >> split my collection in 2 different collections, combine them separately
>> &
>> >> union them again.
>> >> But this give a lot of overhead for something that would be supported by
>> >> native M/R.
>> >>
>> >> I looked in the code and it seems that crunch internally has a
>> NodeContext
>> >> object to indicate COMBINE or REDUCE, but this context is not
>> accessible in
>> >> the DoFn.
>> >> As the (RT)Node object is an internal crunch object, it's also not a
>> clean
>> >> solution to expose the NodeContext.
>> >> So, as a better solution, it would be possible to create a new method:
>> >> combineValues(combineFn, reduceFn) on PGroupedTable. The existing
>> >> combineValues(combineFn) is in that case just a convenience method for
>> must
>> >> use cases, where the combineFn & reduceFn is the same function.
>> >> With this new method, I would be able to just create my combineFn twice
>> &
>> >> pass a boolean in the constructor to indicate if it's combine or reduce.
>> >>
>> >> I already made a patch to add this function, but as the procedure
>> >> indicates to discuss the change first, I'll write this mail first to
>> check
>> >> what you think. (I also didn't test my patch yet, although all unit &
IT
>> >> still pass)
>> >>
>> >> Thanks
>> >> Stefan
>> >>
>> >>
>>
>
>
>
> --
> Director of Data Science
> Cloudera <http://www.cloudera.com>
> Twitter: @josh_wills <http://twitter.com/josh_wills>

Mime
View raw message