Return-Path: X-Original-To: apmail-crunch-dev-archive@www.apache.org Delivered-To: apmail-crunch-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 47E7C10968 for ; Fri, 18 Oct 2013 01:19:21 +0000 (UTC) Received: (qmail 82259 invoked by uid 500); 18 Oct 2013 01:19:21 -0000 Delivered-To: apmail-crunch-dev-archive@crunch.apache.org Received: (qmail 82239 invoked by uid 500); 18 Oct 2013 01:19:21 -0000 Mailing-List: contact dev-help@crunch.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@crunch.apache.org Delivered-To: mailing list dev@crunch.apache.org Received: (qmail 82231 invoked by uid 99); 18 Oct 2013 01:19:21 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 18 Oct 2013 01:19:21 +0000 X-ASF-Spam-Status: No, hits=1.7 required=5.0 tests=FREEMAIL_ENVFROM_END_DIGIT,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of cbiswas1983@gmail.com designates 209.85.223.173 as permitted sender) Received: from [209.85.223.173] (HELO mail-ie0-f173.google.com) (209.85.223.173) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 18 Oct 2013 01:19:14 +0000 Received: by mail-ie0-f173.google.com with SMTP id u16so5492873iet.32 for ; Thu, 17 Oct 2013 18:18:53 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=HTyYTyNQX/BVZILJ5ncmR0j4aobfNG/OvifeADNIltY=; b=fZHvNn3XFzyRPNagfsDb2hl2UeJEiOPkvZHeDiz/Mr2VeY6BZmtF3BVkdvKH1SdpNO YhjuTTVLamGcpO5U2JeSRU555aNokXDPQxNhtnIxPGI/d5U97sarbIh+lcI8mZyMXOcL sscWeVqxVYlFq84++mFnTqgubZ/0mTuPX95oa/2AZLX/dYlsKDwOLWlKBZ56/3Jv+CY+ WvSAFe59rN6mQiT/bKp73UumXXgBygF0778fcocCXDE4y1VlpaU/DB+PlrajmHrmuK6q USPXbdwAqpxMM66tCFDAJ7xvwe0Z1Mgs9DFag2XV0XAsOEFpeTJS42HbrEQ5yOjGW8KK XYEw== MIME-Version: 1.0 X-Received: by 10.50.178.234 with SMTP id db10mr682804igc.35.1382059133085; Thu, 17 Oct 2013 18:18:53 -0700 (PDT) Received: by 10.64.228.242 with HTTP; Thu, 17 Oct 2013 18:18:53 -0700 (PDT) In-Reply-To: References: Date: Thu, 17 Oct 2013 20:18:53 -0500 Message-ID: Subject: Re: Process of CombineFn returns ? From: Chandan Biswas To: dev@crunch.apache.org Content-Type: multipart/alternative; boundary=089e015387f0afb94604e8f9b7db X-Virus-Checked: Checked by ClamAV on apache.org --089e015387f0afb94604e8f9b7db Content-Type: text/plain; charset=ISO-8859-1 I have PTable. and getting after reduce PTable T--> Comment{ String comment, String author}, U--> Book{String id, String lengthiestComment, int noOfComments} But wanted to some aggregations in the map side based on some logic instead of all aggregations at reduce side. Yes in worst case, data flow over the n/w will remain same, but sorting will be improved. Thanks, Chandan On Thu, Oct 17, 2013 at 6:46 PM, Josh Wills wrote: > On Thu, Oct 17, 2013 at 4:41 PM, Chandan Biswas >wrote: > > > Yeah, I agree with Micah that it will not eliminate the reduce phase > > entirely. But the dummy object of U suggested by Josh (or converting to U > > type in map for every record) will not improve performance because same > > amounts of records will be sorted and aggregated in the reduce phase. > > > I don't think that's true-- the records of type U will be combined on the > map-side, which would reduce the amount of data that is pushed over the > network and improve performance. > > Can you give any additional details about what T and U are in this > scenario? :) > > > > > But > > my point is, can we improve it by applying a combiner where the combineFn > > provides output as different type. If we have same type, we can use the > > combiner to do some aggregation in map side which improves performance. > > But, can we have some mechanism by which the same advantage can be > achieved > > when combineFn emits different type. I think, emitting same type by > > CombineFn has restricted its use. Can we have new CombineFn that allows > us > > to output different type not only same type as input? > > > > > > On Thu, Oct 17, 2013 at 5:05 PM, Josh Wills wrote: > > > > > Yeah, my experience in these kinds of situations is that you need to > come > > > up with a "dummy" or singleton version of U for the case where there is > > > only a single T and do that conversion on the map side of the job, > before > > > the combiner runs. I think Chao had an issue like this awhile ago, > where > > he > > > had a PTable and wanted to write a combiner that would > > > return a PTable>. The solution was to > convert > > > the map-side object to a PTable>, where the > > > value on the map-side was a singleton list containing just that double > > > value. Does that sort of trick work here? > > > > > > > > > On Thu, Oct 17, 2013 at 2:57 PM, Micah Whitacre > > wrote: > > > > > > > Ok so the feature you are trying to achieve is the proactive > > combination > > > of > > > > data before performing the GBK like the javadoc describes. > Essentially > > > in > > > > that situation the CombineFn is being used as a Combiner[1] to > combine > > > the > > > > data local to that mapper before doing the GBK and then further > > combining > > > > the data in the reduce operation. It will not necessarily eliminate > > the > > > > need for all processing in the reduce. > > > > > > > > If you want to use this functionality you will need to do the > > following: > > > > > > > > PTable map to PTable > > > > PTable gbk to PGT > > > > PGT combine PTable > > > > > > > > This will take advantage of any optimization provided by the > CombineFn. > > > > > > > > [1] - http://wiki.apache.org/hadoop/HadoopMapReduce > > > > > > > > > > > > > > > > On Thu, Oct 17, 2013 at 4:30 PM, Chandan Biswas < > cbiswas1983@gmail.com > > > > >wrote: > > > > > > > > > Hello Micah, > > > > > Yes we are using MapFn now. That aggregation and computation is > being > > > > done > > > > > in reduce phase. As CombineFn after GBK runs into map side, then > > those > > > > most > > > > > computations can be done in map side which are now running in > reduce > > > > phase. > > > > > Some smaller aggregations and computations can be done on reduce > > phase. > > > > > My point was to do some aggregation (and create a new object) in > map > > > > phase > > > > > instead of in reduce phase. > > > > > > > > > > Thanks, > > > > > Chandan > > > > > > > > > > > > > > > On Thu, Oct 17, 2013 at 3:48 PM, Micah Whitacre > > > > wrote: > > > > > > > > > > > Chandan, > > > > > > I think what you are wanting will just be a simple MapFn > instead > > > of > > > > a > > > > > > CombineFn. The doc of the CombineFn[1] sounds like what you want > > > with > > > > > the > > > > > > statement "A special > > > > > > DoFn< > > > > http://crunch.apache.org/apidocs/0.7.0/org/apache/crunch/DoFn.html> > > > > > > implementation > > > > > > that converts an > > > > > > Iterable< > > > > > > > > > > > > > > > > > > > > > http://download.oracle.com/javase/6/docs/api/java/lang/Iterable.html?is-external=true > > > > > > > > > > > > > of > > > > > > values into a single value" but it is expecting the value to be > of > > > the > > > > > same > > > > > > time. Since you are wanting to combine the values into a > different > > > > form > > > > > it > > > > > > should be fairly trivial to write a MapFn that converts the > > > Iterable > > > > > -> > > > > > > U. > > > > > > > > > > > > [1] - > > > > > > > > > > > > http://crunch.apache.org/apidocs/0.7.0/org/apache/crunch/CombineFn.html > > > > > > > > > > > > > > > > > > On Thu, Oct 17, 2013 at 3:30 PM, Chandan Biswas < > > > cbiswas1983@gmail.com > > > > > > >wrote: > > > > > > > > > > > > > I was trying to refactoring some stuffs and trying to use > > > combineFn. > > > > > > > But when I went into deeper, found that I can't do it as Crunch > > > > doesn't > > > > > > > allow it the functionality I needed. For example, I have a > > > > > > > PGroupedTable. I wanted to apply CombineFn on it and > > > wanted > > > > > to > > > > > > > get PCollection instead of T. Right now, CombineFn allows > > only > > > > > same > > > > > > > type as return value. The use case of this need is that there > > will > > > be > > > > > > some > > > > > > > time saving in sorting. It's natural that when aggregating some > > > > objects > > > > > > at > > > > > > > map side can create a new different type object. > > > > > > > > > > > > > > Any thought on it? Am I missing any thing? If this can be > written > > > in > > > > > > > different way using existing way please let me know. > > > > > > > > > > > > > > Thanks > > > > > > > Chandan > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > Director of Data Science > > > Cloudera > > > Twitter: @josh_wills > > > > > > > > > -- > Director of Data Science > Cloudera > Twitter: @josh_wills > --089e015387f0afb94604e8f9b7db--