Return-Path: Delivered-To: apmail-hadoop-pig-dev-archive@www.apache.org Received: (qmail 28689 invoked from network); 12 Mar 2010 22:01:50 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 12 Mar 2010 22:01:50 -0000 Received: (qmail 49996 invoked by uid 500); 12 Mar 2010 22:01:12 -0000 Delivered-To: apmail-hadoop-pig-dev-archive@hadoop.apache.org Received: (qmail 49975 invoked by uid 500); 12 Mar 2010 22:01:12 -0000 Mailing-List: contact pig-dev-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: pig-dev@hadoop.apache.org Delivered-To: mailing list pig-dev@hadoop.apache.org Received: (qmail 49960 invoked by uid 99); 12 Mar 2010 22:01:12 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 12 Mar 2010 22:01:12 +0000 X-ASF-Spam-Status: No, hits=2.5 required=10.0 tests=FREEMAIL_FROM,HTML_FONT_FACE_BAD,HTML_MESSAGE,RCVD_IN_DNSWL_NONE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of hc.busy@gmail.com designates 74.125.92.24 as permitted sender) Received: from [74.125.92.24] (HELO qw-out-2122.google.com) (74.125.92.24) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 12 Mar 2010 22:01:10 +0000 Received: by qw-out-2122.google.com with SMTP id 8so691396qwh.35 for ; Fri, 12 Mar 2010 14:00:48 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:date:message-id:subject :from:to:content-type; bh=0z3App7DCfuyav+kS/lDUR03rStHTp1qPnV8sN0d6wM=; b=uTxmMmZqAHuvNyfWZV2MkzgJGtM/mzQbunnJAL05hOS/xQDtZxqkC0rGNM/VrwZnCD xUQZEW/ZALZu/WEHryql/J/DCalNIf/OpzIxE72CQieT8t6sn6ATc+TOuYC2JKshcoSJ R6Bw30V3wIVDdEADMyWiQM12vfIBcMWatQopw= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:date:message-id:subject:from:to:content-type; b=pa1MByTmIvRkQBzOdQv/ZYJEkecCNjS9G8dVHTTk+4dx9AiOXWFAvqMlWUdGTENY0s SGKhcG5RSbSyFxF6bbkvpfhl+xBQt0rJFBa8ZLR+SY7lBvnQCES7+KqCvWVsLtCHNvMa ZGTMyLAwN8ip2KwTq5P6vKN+urTAMJJrN4r2s= MIME-Version: 1.0 Received: by 10.220.126.129 with SMTP id c1mr687014vcs.55.1268431248706; Fri, 12 Mar 2010 14:00:48 -0800 (PST) Date: Fri, 12 Mar 2010 14:00:48 -0800 Message-ID: <2cbdb1fd1003121400m44e1044cm1c1327180dd2cc56@mail.gmail.com> Subject: Operating on Cogroups and Iterations in Pig Re: more bagging fun From: hc busy To: "pig-user@hadoop.apache.org" , pig-dev@hadoop.apache.org Content-Type: multipart/alternative; boundary=0016369c9078004ebc0481a1aac3 --0016369c9078004ebc0481a1aac3 Content-Type: text/plain; charset=ISO-8859-1 Hmm, okay, I read the documentation further and it appears that this has already been discussed previously (here).There seem to be a question of what's the right thing to do. It seems clear to me though. When an operation like '*' is applied, this is clearly an item-wise operation that is to be applied to each member of the bag. If a function is aggregate (SUM), then it operates across an entire bag. When a COGROUP occurs, just do what SQL does. Which is to say, perform cross join if an aggregate has been applied across several bags. And do so automatically, so we don't have to type out the separate FLATTEN's grouped = COGROUP employee BY name, bonuses BY name; flattened = FOREACH grouped GENERATE group, *FLATTEN(employee), FLATTEN(bonuses);grouped_again = GROUP flattened BY group; total_compensation = FOREACH grouped_again GENERATE group, SUM(employee:salary * bonuses:multiplier);* So this should do the same: grouped = COGROUP employee BY name, bonuses BY name; total_compensation = FOREACH grouped GENERATE group, SUM(employee:salary * bonuses:multiplier); automatically, because that can only have one meaning. Alternatively, if it is desired to stay with a low-level language, the solution to all of this confusion around UDF's that take bag's and UDF's that operate on members of bags can be resolved if we do two things. 1.) Allow UDF's to actually become first class citizens. This way we can pass UDF's to other UDF's. 2.) introduce the concept of map() and reduce() operator over bags. This two things allows us more freedom and follows the paradigm of map-reducing more closely. grouped = COGROUP employee BY name, bonuses BY name; total_compensation = FOREACH grouped GENERATE group, reduce(SUM,map(*,employee::salary,bonuses::multiplier)); Actually, this may deserve a separate keyword. Because map and reduce operate on single bags where as Pig introduces this concept of co-grouping, so we should have *comap *and *coreduce* that take functions and operate on multiple bags that are results of a *cogroup*. grouped = COGROUP employee BY name, bonuses BY name; total_compensation = FOREACH grouped GENERATE group, REDUCE(SUM,COMAP(*, employee::salary,bonuses::multiplier)); This allows us to write efficiently, on one line, what would other wise be several aliases and unnecessary FLATTENed cross products. A second thing that I see is the recommendation of implementing looping constructs. I wonder if I may suggest, as a follow up to the above, that we beef up UDF's as first class citizens and add the ability to create UDF functions in Pig Latin with the ability to recurse. The reason why I think this is a better way to loop than *for(;;)* and * while(){}* and *do{}while()* statements is that recursive calls are functional and are more easily optimizable than imperative programming. The PigJournal has an entry for all of these constructs and functions under the heading "Extending Pig to Include Branching, Looping, and Functions", but because map-reduce paradigm is inherently functional, I would rather think that staying functional would be a better way to approach this improvement. So the minimal amount of additional features needed is to implement functions and branching and we would have loops as a side-effect of those improvements. In order for the optimizations to be available to PigLatin interpreter, the functions and branching *must* be implemented within the Pig system. If it is externalized, or implemented as UDL of some other language, then opportunities for optimization of the execution vanishes. Anyways, a couple of cents on a rainy day. On Wed, Mar 10, 2010 at 10:15 AM, hc busy wrote: > An additional thought... we can define udf's like > > ADD(bag{(int,int)}), DIVIDE(bag{(int,int)}), MULTIPLY(bag{(int,int)}), > SQRT(bag{(float)}).. > > basically vectorize most of the common arithmetic operations, but then the > language has to support it by converting > > bag.a + bag.b > > to > > ADD(bag.(a,b)) > > I guess there are some difficulties, for instance: > > SQRT(bag.a)+bag.b > > How would this work? because sqrt(bag.a) returns a bag, how would we > convert it to the correct per tuple operation? It's almost like we want to > convert an expression > > SUM(SQRT(bag.a),bag.b) > > into a function F such that > > SUM(SQRT(bag.a),bag.b) = F(bag.a,bag.b) > > and then the F is computed by iterating through on each tuple of the bag. > > FOREACH ... GENERATE ..., F(bag.(a,b)); > > > > > > > On Wed, Mar 10, 2010 at 9:31 AM, hc busy wrote: > >> >> So, pig team, what is the right way to accomplish this? >> >> >> On Tue, Mar 9, 2010 at 10:50 PM, Mridul Muralidharan < >> mridulm@yahoo-inc.com> wrote: >> >>> On Tuesday 09 March 2010 04:13 AM, hc busy wrote: >>> >>>> okay. Here's the bag that I have: >>>> >>>> {group: (a: int,b: chararray,c: chararray,d: int), TABLE: {number1: >>>> int, >>>> number2:int}} >>>> >>>> >>>> >>>> and I want to do this >>>> >>>> grunt> CALCULATE= FOREACH TABLE_group GENERATE group, SUM(TABLE.number1 >>>> / >>>> TABLE.number2); >>>> >>> >>> >>> TABLE.number1 actually gives you the bag of number1 values found in TABLE >>> - but I am never really sure of the semantics in these situations since I am >>> slightly nervous that it is impl dependent ... my understanding is, what you >>> are attempting should not work, but I could be wrong. >>> >>> I do know that TABLE.(number1, number2) will consistently project and >>> pair up the fields : so to 'fix' this, you can write your own DIVIDE_SUM >>> which does something like this : >>> >>> grunt> CALCULATE= FOREACH TABLE_group GENERATE group, >>> DIVIDE_SUM(TABLE.(number1 , number2)); >>> >>> And DIVIDE_SUM udf impl takes in a bag with tuples containing schema >>> (numerator, denominator) : and returns : >>> >>> result == sum ( foreach tuple ( tuple.numerator / tuple.denominator ) ); >>> >>> >>> Obviously, this is not as 'elegant' as your initial code and is >>> definitely more cumbersome ... so clarifying this behavior with someone from >>> pig team will definitely be better before you attempt this. >>> >>> >>> Regards, >>> Mridul >>> >>> >>> >>>> grunt> DUMP CALCULATE; >>>> >>>> 2010-03-08 14:02:41,055 [main] ERROR org.apache.pig.tools.grunt.Grunt - >>>> ERROR 1039: Incompatible types in Multiplication Operator left hand >>>> side:bag >>>> right hand side:bag >>>> >>>> >>>> >>>> This seems useful that I may want to calculate an agg. of some >>>> arithmetic >>>> operations on member of a bag. Any suggestions? >>>> >>>> ... Looking at the documentation it looks like I want to do something >>>> like >>>> >>>> SUM(TABLE.(number1 / number2)) >>>> >>>> but that doesn't work either :-( >>>> >>> >>> >> > --0016369c9078004ebc0481a1aac3--