crunch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Josh Wills <josh.wi...@gmail.com>
Subject Re: Ability to specify a combiner (with different signature than reducer)
Date Wed, 25 Sep 2013 14:07:43 GMT
Hey Chao/Gabriel,

You two seem to be agreeing, which makes me think I misread Chao's initial
problem specification. :) In any case, it seems like the PTable<K,
Collection<V>> approach will do what you want here, which makes me happy.

J


On Wed, Sep 25, 2013 at 6:32 AM, Chao Shi <stepinto@live.com> wrote:

> Hi Josh,
>
> I don't quite understand your second paragraph. Did you mean Gabriel's
> approach? As a reducer reads output from a combiner, this requires it must
> read PType<String, Colletcion<Integer>>. In fact, with this approach, I
> don't think the CombineFn needs to tell whether it is run in combiner or
> reducer context: it simply emits top K values. If there no much overhead to
> use the singleton collection, I think this approach would perfectly fit
> crunch's model.
>
>
> 2013/9/25 Josh Wills <josh.wills@gmail.com>
>
>> FWIW, what I usually do in these situations (and they seem to come up a
>> lot for machine learning projects) is use a combiner with a post-processing
>> reducer that has a different signature. Chao's case is a little different
>> because the DoFn needs to know whether it's in the combiner or the reducer
>> contexts, but the Crunch framework knows this via the NodeContext, so there
>> must be a way to communicate this to the CombineFn. If there isn't, we
>> should make a change to expose it.
>>
>> For this example, the output of both my Combiner and my Reducer would be
>> a Collection<Integer>, and if I was in the reducer case, I would emit just
>> a single Integer to that collection (the max from that combiner), and if I
>> was in the reducer context, I would emit the entire Iterable<Integer> as a
>> Collection<Integer>. Then I would have a post-processing MapFn that would
>> take the values from the Collection<Integer> and join them to a string.
>>
>>
>> On Wed, Sep 25, 2013 at 2:58 AM, Chao Shi <stepinto@live.com> wrote:
>>
>>> Yes. It was a typo. I mean PTable#combineValues.
>>>
>>>
>>> 2013/9/25 Gabriel Reid <gabriel.reid@gmail.com>
>>>
>>>> Hi Chao,
>>>>
>>>>
>>>>> Your approach is tricky. I agree that this kind of MR logic is pretty
>>>>> common. So it would be nice to add such feature to crunch. At the first
>>>>> glance, I think the problem in PTable#collectValues is that it return
a
>>>>> PTable rather than a PGroupedTable (I haven't check the internal logic
yet).
>>>>>
>>>>>
>>>> I think that PTable#collectValues is for a different kind of use case
>>>> -- internally it just does a groupByKey and then puts all the values in a
>>>> single collection for each key, so I'm not sure how it would apply here.
Or
>>>> did you mean the combineValues method?
>>>>
>>>> - Gabriel
>>>>
>>>
>>>
>>
>

Mime
View raw message