hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mridul Muralidharan <mrid...@yahoo-inc.com>
Subject Re: Issues with group as an alias
Date Sat, 14 Jun 2008 13:46:00 GMT

So what is the conclusion here ?

group key alias == the first variables group by field ?


What happens in a case like this then :

--
A = load 'somefile1' USING PigStorage() AS (B, C)
B = load 'somefile2' USING PigStorage() AS (A, C)
C = load 'somefile3' USING PigStorage() AS (A, B)

G1 = COGROUP A by B, B by A;
G2 = COGROUP A by C, C by A;
...
--

A slightly contrived example for sure, but imo grammer has to be as 
clearly specified as possible.

A reserved keyword as group alias implies we dont hit this problem 
(group or groupkey or grpkey)... and also the fact that we are 
backwardly compatible.

[I never liked inferred schema prefix section in the schemas doc (which 
is applied selectively) - makes it extremely tough to generate pig scripts]


Regards,
Mridul



Alan Gates wrote:
> Currently in Pig Latin, anytime a (CO)GROUP statement is used, the field 
> (or set of fields) that are grouped on are given the alias 'group'.  
> This has a couple of issues:
> 
> 1)  It's confusing.  'group' is now a keyword and an alias.
> 2)  We don't currently allow 'group' as an alias in an AS.  It is 
> strange to have an alias that can only be assigned by the language and 
> never by the user.
> 
> Possible solutions:
> 
> I) Status quo.  We could fix it so that group is allowed to be assigned 
> as an alias in AS.
> 
> Pros:  Backward compatibility
> Cons: a) will make the parser more complicated
>      b) see 1) above.
> 
> 
> II) Don't give an implicit alias to the group key(s).  If users want an 
> alias, they can assign it using AS.
> 
> Pros:  Simplicity
> Cons:  We do assign aliases to grouped bags.  That is, if we have C = 
> GROUP B by $0 the resulting schema of C is (group, B).  So if we don't 
> assign an alias to the group key, we now have a schema ($0, B).  This 
> seems strange.  And worse yet, if users want to alias the group key(s), 
> they'll be forced to alias all the grouped bags as well.
> 
> III) Carry the alias (if any) that the field had before.  So if we had a 
> script like:
> 
> A = load 'myfile' as (x, y, z);
> B = group A by x;
> 
> The the schema of B would be (x, A).  This is quite natural for grouping 
> of single columns.  But it turns nasty when you group on multiple 
> columns.  Do we then append the names to together?  So if you have
> 
> B = group A by x, y;
> 
> is the resulting schema (x_y, A)?  Ugh.
> 
> In this case there is also the question of what to do in the case of 
> cogroups, where the key may be named differently in different relations.
> 
> A = load 'myfile' as (x, y, z);
> B = load 'myotherfile' as (t, u, v);
> C = cogroup A by x, B by t;
> 
> Is the resulting schema (x, A, B) or (t, A, B) or are both valid?  This 
> could be resolved by either saying first one always wins, or allowing 
> either.
> 
> Pros:  Very natural for the users, their fields maintain names through 
> the query.
> Cons:  Quickly gets burdensome in the case of multi-key groups.
> 
> IV) Assign a non-keyword alias to the group key, like grp or groupkey or 
> grpkey (or some other suitable choice).
> Pros:  Least disruptive change.  Users only have to go through their 
> scripts and find places where they use the group alias and change it to 
> grp (or whatever).
> Cons:  Still leaves us with a situation where we are assigning a name to 
> a field arbtrarily, leaving users confused as to how their fields got 
> named that.
> 
> V) Remove GROUP as a keyword.  It is just short for COGROUP of one 
> relation anyway.
> 
> Pros:  Smaller syntax in a language is always good.
> Cons:  Will break a lot of scripts, and confuse a lot of users who only 
> think in terms of GROUP and JOIN and never use COGROUP explicitly.
> 
> One could also conceive of combinations of these.  For example, we 
> always assign a name like grpkey to the group key(s), and in the single 
> key case we also carry forward the alias that the field already had, if 
> any.
> 
> Thoughts?  Other possibilities?
> 
> Alan.


Mime
View raw message