crunch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gabriel Reid <gabriel.r...@gmail.com>
Subject Re: Ability to specify a combiner (with different signature than reducer)
Date Wed, 25 Sep 2013 13:23:49 GMT
On Wed, Sep 25, 2013 at 2:36 PM, Josh Wills <josh.wills@gmail.com> wrote:

> FWIW, what I usually do in these situations (and they seem to come up a
> lot for machine learning projects) is use a combiner with a post-processing
> reducer that has a different signature. Chao's case is a little different
> because the DoFn needs to know whether it's in the combiner or the reducer
> contexts, but the Crunch framework knows this via the NodeContext, so there
> must be a way to communicate this to the CombineFn. If there isn't, we
> should make a change to expose it.
>

That sounds like it would be pretty handy -- I remember someone else on the
list asking about a similar thing a few months ago as well.


>
> For this example, the output of both my Combiner and my Reducer would be a
> Collection<Integer>, and if I was in the reducer case, I would emit just a
> single Integer to that collection (the max from that combiner), and if I
> was in the reducer context, I would emit the entire Iterable<Integer> as a
> Collection<Integer>. Then I would have a post-processing MapFn that would
> take the values from the Collection<Integer> and join them to a string.
>

I think that's along the same kind of line that I was going with, but if
I'm understanding the issue correctly then there shouldn't even be a need
to know if you're in the reducer or combiner if you're working with
Collection<Integer>. I think that the combiner would be outputting the
top-k entries, and not just the top-1 entry, so both the combiner and the
reducer have the same logic, and can be the same class (although this
necessitates converting the PTable<K, V> to PTable<K, Collection<V>> at
the
start).

- Gabriel


>
>
> On Wed, Sep 25, 2013 at 2:58 AM, Chao Shi <stepinto@live.com> wrote:
>
>> Yes. It was a typo. I mean PTable#combineValues.
>>
>>
>> 2013/9/25 Gabriel Reid <gabriel.reid@gmail.com>
>>
>>> Hi Chao,
>>>
>>>
>>>> Your approach is tricky. I agree that this kind of MR logic is pretty
>>>> common. So it would be nice to add such feature to crunch. At the first
>>>> glance, I think the problem in PTable#collectValues is that it return a
>>>> PTable rather than a PGroupedTable (I haven't check the internal logic yet).
>>>>
>>>>
>>> I think that PTable#collectValues is for a different kind of use case --
>>> internally it just does a groupByKey and then puts all the values in a
>>> single collection for each key, so I'm not sure how it would apply here. Or
>>> did you mean the combineValues method?
>>>
>>> - Gabriel
>>>
>>
>>
>

Mime
View raw message