Return-Path: X-Original-To: apmail-crunch-dev-archive@www.apache.org Delivered-To: apmail-crunch-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 2D543109FB for ; Fri, 18 Oct 2013 01:48:41 +0000 (UTC) Received: (qmail 1494 invoked by uid 500); 18 Oct 2013 01:48:41 -0000 Delivered-To: apmail-crunch-dev-archive@crunch.apache.org Received: (qmail 1474 invoked by uid 500); 18 Oct 2013 01:48:41 -0000 Mailing-List: contact dev-help@crunch.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@crunch.apache.org Delivered-To: mailing list dev@crunch.apache.org Received: (qmail 1466 invoked by uid 99); 18 Oct 2013 01:48:41 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 18 Oct 2013 01:48:41 +0000 X-ASF-Spam-Status: No, hits=2.5 required=5.0 tests=FREEMAIL_REPLY,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of mkwhit@gmail.com designates 209.85.223.178 as permitted sender) Received: from [209.85.223.178] (HELO mail-ie0-f178.google.com) (209.85.223.178) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 18 Oct 2013 01:48:34 +0000 Received: by mail-ie0-f178.google.com with SMTP id to1so7031118ieb.37 for ; Thu, 17 Oct 2013 18:48:12 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :content-type; bh=OGlcjMhoGlsrBx9MXK8+rhYYEuf4bKQ0PJjoB+a2UR4=; b=XvwktMVvBYGbaltGbJphde8moUl7XYVPIYJcBFlLmCtmRVRmAM1NaRn5AGcBcDpkpw pKLjXZYaLylXSxKw4HMQ+0b8USMlzN5D/6Fp+TUwsLFMA8UUHraAhWu0hTnSvfYViJ7W 6qDxtFypnMfZLogOD5XHHiDA8IlrozU28sdaTy69/2DPoaapWrS/uuV3rlwFwKJfx5yH XtHmmPKuQNUtySOUpg4IQ7eSDgaa1IYBcb/+CrMxCrLOUwybObiSa6LPR3QrgnLThwI/ U2LEEvtMFpk9PAjdJ9F7PzRTcYnveiajF6mLyplHYBVTE9/l7vdcV9OCbmd2DyaufcLg hhlA== X-Received: by 10.50.120.104 with SMTP id lb8mr867918igb.22.1382060892301; Thu, 17 Oct 2013 18:48:12 -0700 (PDT) MIME-Version: 1.0 Received: by 10.64.87.198 with HTTP; Thu, 17 Oct 2013 18:47:52 -0700 (PDT) In-Reply-To: References: From: Micah Whitacre Date: Thu, 17 Oct 2013 20:47:52 -0500 Message-ID: Subject: Re: Process of CombineFn returns ? To: dev@crunch.apache.org Content-Type: multipart/alternative; boundary=047d7bd76bb28b303104e8fa20f4 X-Virus-Checked: Checked by ClamAV on apache.org --047d7bd76bb28b303104e8fa20f4 Content-Type: text/plain; charset=ISO-8859-1 Chandan, So let's apply your situation to the types and conversion that is proposed and break it down where logic will be applied. Say we have a PCollection that is like the following: Mapper 1: Mapper 2 This will be represented by the PTable. We then apply a MapFn to transform it into PTable and we'd get the following in our PCollection: Mapper 1 > > > Mapper 2 > Then if we were to use the GBK + CombineFn, the output of the map phase would be.. Mapper 1 > > Mapper 2 > Notice Mapper 1 would only be emitting 2 items instead of 3 and therefore less data is sent over the wire and has to be sorted. Also in the reducer after the GBK is completed the CombineFn would finish its work and you'd get the following: Reducer 1 > > The only case where this would not improve performance is if you never emit data for the same key from the same mapper or your mapper doesn't reduce the size of the data. On Thu, Oct 17, 2013 at 8:18 PM, Chandan Biswas wrote: > I have PTable. and getting after reduce PTable Book> > > T--> Comment{ String comment, String author}, U--> Book{String id, String > lengthiestComment, int noOfComments} > > But wanted to some aggregations in the map side based on some logic instead > of all aggregations at reduce side. > Yes in worst case, data flow over the n/w will remain same, but sorting > will be improved. > > Thanks, > Chandan > > > On Thu, Oct 17, 2013 at 6:46 PM, Josh Wills wrote: > > > On Thu, Oct 17, 2013 at 4:41 PM, Chandan Biswas > >wrote: > > > > > Yeah, I agree with Micah that it will not eliminate the reduce phase > > > entirely. But the dummy object of U suggested by Josh (or converting > to U > > > type in map for every record) will not improve performance because > same > > > amounts of records will be sorted and aggregated in the reduce phase. > > > > > > I don't think that's true-- the records of type U will be combined on the > > map-side, which would reduce the amount of data that is pushed over the > > network and improve performance. > > > > Can you give any additional details about what T and U are in this > > scenario? :) > > > > > > > > > But > > > my point is, can we improve it by applying a combiner where the > combineFn > > > provides output as different type. If we have same type, we can use the > > > combiner to do some aggregation in map side which improves performance. > > > But, can we have some mechanism by which the same advantage can be > > achieved > > > when combineFn emits different type. I think, emitting same type by > > > CombineFn has restricted its use. Can we have new CombineFn that allows > > us > > > to output different type not only same type as input? > > > > > > > > > On Thu, Oct 17, 2013 at 5:05 PM, Josh Wills > wrote: > > > > > > > Yeah, my experience in these kinds of situations is that you need to > > come > > > > up with a "dummy" or singleton version of U for the case where there > is > > > > only a single T and do that conversion on the map side of the job, > > before > > > > the combiner runs. I think Chao had an issue like this awhile ago, > > where > > > he > > > > had a PTable and wanted to write a combiner that > would > > > > return a PTable>. The solution was to > > convert > > > > the map-side object to a PTable>, where > the > > > > value on the map-side was a singleton list containing just that > double > > > > value. Does that sort of trick work here? > > > > > > > > > > > > On Thu, Oct 17, 2013 at 2:57 PM, Micah Whitacre > > > wrote: > > > > > > > > > Ok so the feature you are trying to achieve is the proactive > > > combination > > > > of > > > > > data before performing the GBK like the javadoc describes. > > Essentially > > > > in > > > > > that situation the CombineFn is being used as a Combiner[1] to > > combine > > > > the > > > > > data local to that mapper before doing the GBK and then further > > > combining > > > > > the data in the reduce operation. It will not necessarily > eliminate > > > the > > > > > need for all processing in the reduce. > > > > > > > > > > If you want to use this functionality you will need to do the > > > following: > > > > > > > > > > PTable map to PTable > > > > > PTable gbk to PGT > > > > > PGT combine PTable > > > > > > > > > > This will take advantage of any optimization provided by the > > CombineFn. > > > > > > > > > > [1] - http://wiki.apache.org/hadoop/HadoopMapReduce > > > > > > > > > > > > > > > > > > > > On Thu, Oct 17, 2013 at 4:30 PM, Chandan Biswas < > > cbiswas1983@gmail.com > > > > > >wrote: > > > > > > > > > > > Hello Micah, > > > > > > Yes we are using MapFn now. That aggregation and computation is > > being > > > > > done > > > > > > in reduce phase. As CombineFn after GBK runs into map side, then > > > those > > > > > most > > > > > > computations can be done in map side which are now running in > > reduce > > > > > phase. > > > > > > Some smaller aggregations and computations can be done on reduce > > > phase. > > > > > > My point was to do some aggregation (and create a new object) in > > map > > > > > phase > > > > > > instead of in reduce phase. > > > > > > > > > > > > Thanks, > > > > > > Chandan > > > > > > > > > > > > > > > > > > On Thu, Oct 17, 2013 at 3:48 PM, Micah Whitacre < > mkwhit@gmail.com> > > > > > wrote: > > > > > > > > > > > > > Chandan, > > > > > > > I think what you are wanting will just be a simple MapFn > > instead > > > > of > > > > > a > > > > > > > CombineFn. The doc of the CombineFn[1] sounds like what you > want > > > > with > > > > > > the > > > > > > > statement "A special > > > > > > > DoFn< > > > > > http://crunch.apache.org/apidocs/0.7.0/org/apache/crunch/DoFn.html > > > > > > > > > implementation > > > > > > > that converts an > > > > > > > Iterable< > > > > > > > > > > > > > > > > > > > > > > > > > > > > http://download.oracle.com/javase/6/docs/api/java/lang/Iterable.html?is-external=true > > > > > > > > > > > > > > > of > > > > > > > values into a single value" but it is expecting the value to be > > of > > > > the > > > > > > same > > > > > > > time. Since you are wanting to combine the values into a > > different > > > > > form > > > > > > it > > > > > > > should be fairly trivial to write a MapFn that converts the > > > > Iterable > > > > > > -> > > > > > > > U. > > > > > > > > > > > > > > [1] - > > > > > > > > > > > > > > > > http://crunch.apache.org/apidocs/0.7.0/org/apache/crunch/CombineFn.html > > > > > > > > > > > > > > > > > > > > > On Thu, Oct 17, 2013 at 3:30 PM, Chandan Biswas < > > > > cbiswas1983@gmail.com > > > > > > > >wrote: > > > > > > > > > > > > > > > I was trying to refactoring some stuffs and trying to use > > > > combineFn. > > > > > > > > But when I went into deeper, found that I can't do it as > Crunch > > > > > doesn't > > > > > > > > allow it the functionality I needed. For example, I have a > > > > > > > > PGroupedTable. I wanted to apply CombineFn on it > and > > > > wanted > > > > > > to > > > > > > > > get PCollection instead of T. Right now, CombineFn > allows > > > only > > > > > > same > > > > > > > > type as return value. The use case of this need is that there > > > will > > > > be > > > > > > > some > > > > > > > > time saving in sorting. It's natural that when aggregating > some > > > > > objects > > > > > > > at > > > > > > > > map side can create a new different type object. > > > > > > > > > > > > > > > > Any thought on it? Am I missing any thing? If this can be > > written > > > > in > > > > > > > > different way using existing way please let me know. > > > > > > > > > > > > > > > > Thanks > > > > > > > > Chandan > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > Director of Data Science > > > > Cloudera > > > > Twitter: @josh_wills > > > > > > > > > > > > > > > -- > > Director of Data Science > > Cloudera > > Twitter: @josh_wills > > > --047d7bd76bb28b303104e8fa20f4--