Return-Path: X-Original-To: apmail-crunch-dev-archive@www.apache.org Delivered-To: apmail-crunch-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id AB041106E3 for ; Thu, 17 Oct 2013 23:42:17 +0000 (UTC) Received: (qmail 75000 invoked by uid 500); 17 Oct 2013 23:42:17 -0000 Delivered-To: apmail-crunch-dev-archive@crunch.apache.org Received: (qmail 74967 invoked by uid 500); 17 Oct 2013 23:42:17 -0000 Mailing-List: contact dev-help@crunch.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@crunch.apache.org Delivered-To: mailing list dev@crunch.apache.org Received: (qmail 74959 invoked by uid 99); 17 Oct 2013 23:42:17 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 17 Oct 2013 23:42:17 +0000 X-ASF-Spam-Status: No, hits=1.7 required=5.0 tests=FREEMAIL_ENVFROM_END_DIGIT,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of cbiswas1983@gmail.com designates 209.85.223.182 as permitted sender) Received: from [209.85.223.182] (HELO mail-ie0-f182.google.com) (209.85.223.182) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 17 Oct 2013 23:42:11 +0000 Received: by mail-ie0-f182.google.com with SMTP id as1so5321554iec.27 for ; Thu, 17 Oct 2013 16:41:50 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=0snpSFu99ZLj8BbVYRfb1Uz2jhKNStxgRm5uo/2Y1ow=; b=W0OzfSPFl8NU3+gRtYvLAq2fAm/PYR60L5W0ZGgBIvMo5seA7imPMy8CJlwO3KLzKb TNDk7TDc0HJldVYfXRg1FOS3QDMZ7WXYIllJLvMWmrLNFsYOP5EvRR0+S1ysGY8jDwuG B7oN9Eod2RoSsVNHfoXSHGzD9XUDUUHQDy2KCIGOlQpn2b0NxKy2upcV5uwkvJir5MHV zkQaGs+spG6pqG0m9cyqewJdW2GYHMouahaiNm5lB2xULxyDvq7jg7ag9xdDp7iktQKO /0j4nXlHhuZeg55gENkaVh9wVRAZx4biTqIrgHCJogooSsFuFQyEBEgtl+k1vB8fB7IC VHzQ== MIME-Version: 1.0 X-Received: by 10.50.178.234 with SMTP id db10mr493848igc.35.1382053310045; Thu, 17 Oct 2013 16:41:50 -0700 (PDT) Received: by 10.64.228.242 with HTTP; Thu, 17 Oct 2013 16:41:49 -0700 (PDT) In-Reply-To: References: Date: Thu, 17 Oct 2013 18:41:49 -0500 Message-ID: Subject: Re: Process of CombineFn returns ? From: Chandan Biswas To: dev@crunch.apache.org Content-Type: multipart/alternative; boundary=089e015387f09b281704e8f85c75 X-Virus-Checked: Checked by ClamAV on apache.org --089e015387f09b281704e8f85c75 Content-Type: text/plain; charset=ISO-8859-1 Yeah, I agree with Micah that it will not eliminate the reduce phase entirely. But the dummy object of U suggested by Josh (or converting to U type in map for every record) will not improve performance because same amounts of records will be sorted and aggregated in the reduce phase. But my point is, can we improve it by applying a combiner where the combineFn provides output as different type. If we have same type, we can use the combiner to do some aggregation in map side which improves performance. But, can we have some mechanism by which the same advantage can be achieved when combineFn emits different type. I think, emitting same type by CombineFn has restricted its use. Can we have new CombineFn that allows us to output different type not only same type as input? On Thu, Oct 17, 2013 at 5:05 PM, Josh Wills wrote: > Yeah, my experience in these kinds of situations is that you need to come > up with a "dummy" or singleton version of U for the case where there is > only a single T and do that conversion on the map side of the job, before > the combiner runs. I think Chao had an issue like this awhile ago, where he > had a PTable and wanted to write a combiner that would > return a PTable>. The solution was to convert > the map-side object to a PTable>, where the > value on the map-side was a singleton list containing just that double > value. Does that sort of trick work here? > > > On Thu, Oct 17, 2013 at 2:57 PM, Micah Whitacre wrote: > > > Ok so the feature you are trying to achieve is the proactive combination > of > > data before performing the GBK like the javadoc describes. Essentially > in > > that situation the CombineFn is being used as a Combiner[1] to combine > the > > data local to that mapper before doing the GBK and then further combining > > the data in the reduce operation. It will not necessarily eliminate the > > need for all processing in the reduce. > > > > If you want to use this functionality you will need to do the following: > > > > PTable map to PTable > > PTable gbk to PGT > > PGT combine PTable > > > > This will take advantage of any optimization provided by the CombineFn. > > > > [1] - http://wiki.apache.org/hadoop/HadoopMapReduce > > > > > > > > On Thu, Oct 17, 2013 at 4:30 PM, Chandan Biswas > >wrote: > > > > > Hello Micah, > > > Yes we are using MapFn now. That aggregation and computation is being > > done > > > in reduce phase. As CombineFn after GBK runs into map side, then those > > most > > > computations can be done in map side which are now running in reduce > > phase. > > > Some smaller aggregations and computations can be done on reduce phase. > > > My point was to do some aggregation (and create a new object) in map > > phase > > > instead of in reduce phase. > > > > > > Thanks, > > > Chandan > > > > > > > > > On Thu, Oct 17, 2013 at 3:48 PM, Micah Whitacre > > wrote: > > > > > > > Chandan, > > > > I think what you are wanting will just be a simple MapFn instead > of > > a > > > > CombineFn. The doc of the CombineFn[1] sounds like what you want > with > > > the > > > > statement "A special > > > > DoFn< > > http://crunch.apache.org/apidocs/0.7.0/org/apache/crunch/DoFn.html> > > > > implementation > > > > that converts an > > > > Iterable< > > > > > > > > > > http://download.oracle.com/javase/6/docs/api/java/lang/Iterable.html?is-external=true > > > > > > > > > of > > > > values into a single value" but it is expecting the value to be of > the > > > same > > > > time. Since you are wanting to combine the values into a different > > form > > > it > > > > should be fairly trivial to write a MapFn that converts the > Iterable > > > -> > > > > U. > > > > > > > > [1] - > > > > > > http://crunch.apache.org/apidocs/0.7.0/org/apache/crunch/CombineFn.html > > > > > > > > > > > > On Thu, Oct 17, 2013 at 3:30 PM, Chandan Biswas < > cbiswas1983@gmail.com > > > > >wrote: > > > > > > > > > I was trying to refactoring some stuffs and trying to use > combineFn. > > > > > But when I went into deeper, found that I can't do it as Crunch > > doesn't > > > > > allow it the functionality I needed. For example, I have a > > > > > PGroupedTable. I wanted to apply CombineFn on it and > wanted > > > to > > > > > get PCollection instead of T. Right now, CombineFn allows only > > > same > > > > > type as return value. The use case of this need is that there will > be > > > > some > > > > > time saving in sorting. It's natural that when aggregating some > > objects > > > > at > > > > > map side can create a new different type object. > > > > > > > > > > Any thought on it? Am I missing any thing? If this can be written > in > > > > > different way using existing way please let me know. > > > > > > > > > > Thanks > > > > > Chandan > > > > > > > > > > > > > > > > > > -- > Director of Data Science > Cloudera > Twitter: @josh_wills > --089e015387f09b281704e8f85c75--