hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alan Gates <ga...@yahoo-inc.com>
Subject Re: Issues with group as an alias
Date Mon, 16 Jun 2008 18:31:32 GMT
I would like to propose a slight modification:

I think that we should continue to support 'group' as the alias name for 
some transition period (3 or maybe 6 months).  We can remove all 
references to group as an alias from the documentation and print a 
warning when users use it.  But I don't think we should drop it 
immediately, as we'll break many scripts.

Other than that I'm fine with the proposal.


Chris Olston wrote:
> No.
> The standing proposal for Option III is:
> 1. If you are (CO)Grouping on a *single* field AND in the case of 
> co-group all field names are the same (e.g., cogroup A by url, B by 
> url), then give the group key that name (e.g., "url").
> 2. Else, do *not* automatically assign any name. The user can refer to 
> it as $0 and/or use "AS" to give it a name manually.
> (To be clear, even in case #1, the user has the option to override the 
> automatically-assigned name using "AS" if s/he chooses.)
> -Chris
> On Jun 16, 2008, at 8:25 AM, Benjamin Reed wrote:
>> I completely agree. It does start getting confusing. Especially if we 
>> try to
>> deal with multi field keys.
>> A = load 'somefile1' USING PigStorage() AS (B, C, Z)
>> B = load 'somefile2' USING PigStorage() AS (A, C, Y)
>> C = load 'somefile3' USING PigStorage() AS (A, B)
>> G1 = COGROUP A by (B,C), B by (A, C);
>> G2 = COGROUP G1 by (B_C, A.Z), C by (A, B);
>> What is the schema for G2?
>> ben
>> On Saturday 14 June 2008 06:46:00 Mridul Muralidharan wrote:
>>> So what is the conclusion here ?
>>> group key alias == the first variables group by field ?
>>> What happens in a case like this then :
>>> -- 
>>> A = load 'somefile1' USING PigStorage() AS (B, C)
>>> B = load 'somefile2' USING PigStorage() AS (A, C)
>>> C = load 'somefile3' USING PigStorage() AS (A, B)
>>> G1 = COGROUP A by B, B by A;
>>> G2 = COGROUP A by C, C by A;
>>> ...
>>> -- 
>>> A slightly contrived example for sure, but imo grammer has to be as
>>> clearly specified as possible.
>>> A reserved keyword as group alias implies we dont hit this problem
>>> (group or groupkey or grpkey)... and also the fact that we are
>>> backwardly compatible.
>>> [I never liked inferred schema prefix section in the schemas doc (which
>>> is applied selectively) - makes it extremely tough to generate pig 
>>> scripts]
>>> Regards,
>>> Mridul
>>> Alan Gates wrote:
>>>> Currently in Pig Latin, anytime a (CO)GROUP statement is used, the 
>>>> field
>>>> (or set of fields) that are grouped on are given the alias 'group'.
>>>> This has a couple of issues:
>>>> 1)  It's confusing.  'group' is now a keyword and an alias.
>>>> 2)  We don't currently allow 'group' as an alias in an AS.  It is
>>>> strange to have an alias that can only be assigned by the language and
>>>> never by the user.
>>>> Possible solutions:
>>>> I) Status quo.  We could fix it so that group is allowed to be 
>>>> assigned
>>>> as an alias in AS.
>>>> Pros:  Backward compatibility
>>>> Cons: a) will make the parser more complicated
>>>>      b) see 1) above.
>>>> II) Don't give an implicit alias to the group key(s).  If users 
>>>> want an
>>>> alias, they can assign it using AS.
>>>> Pros:  Simplicity
>>>> Cons:  We do assign aliases to grouped bags.  That is, if we have C =
>>>> GROUP B by $0 the resulting schema of C is (group, B).  So if we don't
>>>> assign an alias to the group key, we now have a schema ($0, B).  This
>>>> seems strange.  And worse yet, if users want to alias the group 
>>>> key(s),
>>>> they'll be forced to alias all the grouped bags as well.
>>>> III) Carry the alias (if any) that the field had before.  So if we 
>>>> had a
>>>> script like:
>>>> A = load 'myfile' as (x, y, z);
>>>> B = group A by x;
>>>> The the schema of B would be (x, A).  This is quite natural for 
>>>> grouping
>>>> of single columns.  But it turns nasty when you group on multiple
>>>> columns.  Do we then append the names to together?  So if you have
>>>> B = group A by x, y;
>>>> is the resulting schema (x_y, A)?  Ugh.
>>>> In this case there is also the question of what to do in the case of
>>>> cogroups, where the key may be named differently in different 
>>>> relations.
>>>> A = load 'myfile' as (x, y, z);
>>>> B = load 'myotherfile' as (t, u, v);
>>>> C = cogroup A by x, B by t;
>>>> Is the resulting schema (x, A, B) or (t, A, B) or are both valid?  
>>>> This
>>>> could be resolved by either saying first one always wins, or allowing
>>>> either.
>>>> Pros:  Very natural for the users, their fields maintain names through
>>>> the query.
>>>> Cons:  Quickly gets burdensome in the case of multi-key groups.
>>>> IV) Assign a non-keyword alias to the group key, like grp or 
>>>> groupkey or
>>>> grpkey (or some other suitable choice).
>>>> Pros:  Least disruptive change.  Users only have to go through their
>>>> scripts and find places where they use the group alias and change 
>>>> it to
>>>> grp (or whatever).
>>>> Cons:  Still leaves us with a situation where we are assigning a 
>>>> name to
>>>> a field arbtrarily, leaving users confused as to how their fields got
>>>> named that.
>>>> V) Remove GROUP as a keyword.  It is just short for COGROUP of one
>>>> relation anyway.
>>>> Pros:  Smaller syntax in a language is always good.
>>>> Cons:  Will break a lot of scripts, and confuse a lot of users who 
>>>> only
>>>> think in terms of GROUP and JOIN and never use COGROUP explicitly.
>>>> One could also conceive of combinations of these.  For example, we
>>>> always assign a name like grpkey to the group key(s), and in the 
>>>> single
>>>> key case we also carry forward the alias that the field already 
>>>> had, if
>>>> any.
>>>> Thoughts?  Other possibilities?
>>>> Alan.
> -- 
> Christopher Olston, Ph.D.
> Sr. Research Scientist
> Yahoo! Research

View raw message