Return-Path: X-Original-To: apmail-crunch-user-archive@www.apache.org Delivered-To: apmail-crunch-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id E31B5102B4 for ; Wed, 25 Sep 2013 12:46:33 +0000 (UTC) Received: (qmail 78581 invoked by uid 500); 25 Sep 2013 12:39:28 -0000 Delivered-To: apmail-crunch-user-archive@crunch.apache.org Received: (qmail 77884 invoked by uid 500); 25 Sep 2013 12:38:27 -0000 Mailing-List: contact user-help@crunch.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@crunch.apache.org Delivered-To: mailing list user@crunch.apache.org Received: (qmail 77317 invoked by uid 99); 25 Sep 2013 12:37:26 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 25 Sep 2013 12:37:26 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of josh.wills@gmail.com designates 209.85.217.182 as permitted sender) Received: from [209.85.217.182] (HELO mail-lb0-f182.google.com) (209.85.217.182) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 25 Sep 2013 12:37:22 +0000 Received: by mail-lb0-f182.google.com with SMTP id c11so5137972lbj.27 for ; Wed, 25 Sep 2013 05:37:01 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :content-type; bh=XpfcBbEyhesPi9KkKhwX+/0Y2jhsmMVi6sWARGufKVM=; b=M+SQFQDTEMr9NeCSl4BgW0ahtLSxt7O4LRQ/5bgw7+LiI6OSeKI17NsxD2KItW616K ksJaOSgev3SF3xUn/B7dQ/yBnwzvxaBeBQ3phfeeRZJTrIil1bSLlbEyEidr0LFT1tEf Hhu4ziIGv+T/+GEk/wslBNTFByKLFek5dmjQDtMV7ORFpf76IGbfbRd/Xy5w0BAISThV hGpz1pxJDOV//CvMzzBpLUAi/MZjfvysBGQPzLaJkhLWZ7LsHUNj4hj4zPsLZ8ML8oeN t7B/yijdTDqxvek4W/4+Uf1EIvmJQqSTWMwigqLJ0kg+EC5c03P/7Ku/AMFNQgVYikNB wYrQ== X-Received: by 10.112.28.109 with SMTP id a13mr29115842lbh.3.1380112621045; Wed, 25 Sep 2013 05:37:01 -0700 (PDT) MIME-Version: 1.0 Received: by 10.112.146.194 with HTTP; Wed, 25 Sep 2013 05:36:40 -0700 (PDT) In-Reply-To: References: From: Josh Wills Date: Wed, 25 Sep 2013 05:36:40 -0700 Message-ID: Subject: Re: Ability to specify a combiner (with different signature than reducer) To: user@crunch.apache.org Content-Type: multipart/alternative; boundary=001a1133f726870ba804e73482b8 X-Virus-Checked: Checked by ClamAV on apache.org --001a1133f726870ba804e73482b8 Content-Type: text/plain; charset=ISO-8859-1 FWIW, what I usually do in these situations (and they seem to come up a lot for machine learning projects) is use a combiner with a post-processing reducer that has a different signature. Chao's case is a little different because the DoFn needs to know whether it's in the combiner or the reducer contexts, but the Crunch framework knows this via the NodeContext, so there must be a way to communicate this to the CombineFn. If there isn't, we should make a change to expose it. For this example, the output of both my Combiner and my Reducer would be a Collection, and if I was in the reducer case, I would emit just a single Integer to that collection (the max from that combiner), and if I was in the reducer context, I would emit the entire Iterable as a Collection. Then I would have a post-processing MapFn that would take the values from the Collection and join them to a string. On Wed, Sep 25, 2013 at 2:58 AM, Chao Shi wrote: > Yes. It was a typo. I mean PTable#combineValues. > > > 2013/9/25 Gabriel Reid > >> Hi Chao, >> >> >>> Your approach is tricky. I agree that this kind of MR logic is pretty >>> common. So it would be nice to add such feature to crunch. At the first >>> glance, I think the problem in PTable#collectValues is that it return a >>> PTable rather than a PGroupedTable (I haven't check the internal logic yet). >>> >>> >> I think that PTable#collectValues is for a different kind of use case -- >> internally it just does a groupByKey and then puts all the values in a >> single collection for each key, so I'm not sure how it would apply here. Or >> did you mean the combineValues method? >> >> - Gabriel >> > > --001a1133f726870ba804e73482b8 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable
FWIW, what I usually do in these situations (and they seem= to come up a lot for machine learning projects) is use a combiner with a p= ost-processing reducer that has a different signature. Chao's case is a= little different because the DoFn needs to know whether it's in the co= mbiner or the reducer contexts, but the Crunch framework knows this via the= NodeContext, so there must be a way to communicate this to the CombineFn. = If there isn't, we should make a change to expose it.

For this example, the output of both my Combiner and my Redu= cer would be a Collection<Integer>, and if I was in the reducer case,= I would emit just a single Integer to that collection (the max from that c= ombiner), and if I was in the reducer context, I would emit the entire Iter= able<Integer> as a Collection<Integer>. Then I would have a pos= t-processing MapFn that would take the values from the Collection<Intege= r> and join them to a string.


On Wed,= Sep 25, 2013 at 2:58 AM, Chao Shi <stepinto@live.com> wrote= :
Yes. It was a typo. I mean PTable#combineValues.


2013/9/25 Gabriel Reid <gabriel.reid@gmail.com= >
Hi Chao,


Your approach is tricky. I agree that = this kind of MR logic is pretty common. So it would be nice to add such fea= ture to crunch. At the first glance, I think the problem in PTable#collectV= alues is that it return a PTable rather than a PGroupedTable (I haven't= check the internal logic yet).


I think that PTable#collectValues is for a dif= ferent kind of use case -- internally it just does a groupByKey and then pu= ts all the values in a single collection for each key, so I'm not sure = how it would apply here. Or did you mean the combineValues method?

- Gabriel


--001a1133f726870ba804e73482b8--