hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Olston <ols...@yahoo-inc.com>
Subject Re: Issues with group as an alias
Date Mon, 16 Jun 2008 19:39:48 GMT
Good idea.

On Jun 16, 2008, at 11:31 AM, Alan Gates wrote:

> I would like to propose a slight modification:
>
> I think that we should continue to support 'group' as the alias  
> name for some transition period (3 or maybe 6 months).  We can  
> remove all references to group as an alias from the documentation  
> and print a warning when users use it.  But I don't think we should  
> drop it immediately, as we'll break many scripts.
>
> Other than that I'm fine with the proposal.
>
> Alan.
>
> Chris Olston wrote:
>> No.
>>
>> The standing proposal for Option III is:
>>
>> 1. If you are (CO)Grouping on a *single* field AND in the case of  
>> co-group all field names are the same (e.g., cogroup A by url, B  
>> by url), then give the group key that name (e.g., "url").
>> 2. Else, do *not* automatically assign any name. The user can  
>> refer to it as $0 and/or use "AS" to give it a name manually.
>>
>> (To be clear, even in case #1, the user has the option to override  
>> the automatically-assigned name using "AS" if s/he chooses.)
>>
>> -Chris
>>
>>
>> On Jun 16, 2008, at 8:25 AM, Benjamin Reed wrote:
>>
>>> I completely agree. It does start getting confusing. Especially  
>>> if we try to
>>> deal with multi field keys.
>>>
>>> A = load 'somefile1' USING PigStorage() AS (B, C, Z)
>>> B = load 'somefile2' USING PigStorage() AS (A, C, Y)
>>> C = load 'somefile3' USING PigStorage() AS (A, B)
>>>
>>> G1 = COGROUP A by (B,C), B by (A, C);
>>> G2 = COGROUP G1 by (B_C, A.Z), C by (A, B);
>>>
>>> What is the schema for G2?
>>>
>>> ben
>>>
>>> On Saturday 14 June 2008 06:46:00 Mridul Muralidharan wrote:
>>>> So what is the conclusion here ?
>>>>
>>>> group key alias == the first variables group by field ?
>>>>
>>>>
>>>> What happens in a case like this then :
>>>>
>>>> -- 
>>>> A = load 'somefile1' USING PigStorage() AS (B, C)
>>>> B = load 'somefile2' USING PigStorage() AS (A, C)
>>>> C = load 'somefile3' USING PigStorage() AS (A, B)
>>>>
>>>> G1 = COGROUP A by B, B by A;
>>>> G2 = COGROUP A by C, C by A;
>>>> ...
>>>> -- 
>>>>
>>>> A slightly contrived example for sure, but imo grammer has to be as
>>>> clearly specified as possible.
>>>>
>>>> A reserved keyword as group alias implies we dont hit this problem
>>>> (group or groupkey or grpkey)... and also the fact that we are
>>>> backwardly compatible.
>>>>
>>>> [I never liked inferred schema prefix section in the schemas doc  
>>>> (which
>>>> is applied selectively) - makes it extremely tough to generate  
>>>> pig scripts]
>>>>
>>>>
>>>> Regards,
>>>> Mridul
>>>>
>>>> Alan Gates wrote:
>>>>> Currently in Pig Latin, anytime a (CO)GROUP statement is used,  
>>>>> the field
>>>>> (or set of fields) that are grouped on are given the alias  
>>>>> 'group'.
>>>>> This has a couple of issues:
>>>>>
>>>>> 1)  It's confusing.  'group' is now a keyword and an alias.
>>>>> 2)  We don't currently allow 'group' as an alias in an AS.  It is
>>>>> strange to have an alias that can only be assigned by the  
>>>>> language and
>>>>> never by the user.
>>>>>
>>>>> Possible solutions:
>>>>>
>>>>> I) Status quo.  We could fix it so that group is allowed to be  
>>>>> assigned
>>>>> as an alias in AS.
>>>>>
>>>>> Pros:  Backward compatibility
>>>>> Cons: a) will make the parser more complicated
>>>>>      b) see 1) above.
>>>>>
>>>>>
>>>>> II) Don't give an implicit alias to the group key(s).  If users  
>>>>> want an
>>>>> alias, they can assign it using AS.
>>>>>
>>>>> Pros:  Simplicity
>>>>> Cons:  We do assign aliases to grouped bags.  That is, if we  
>>>>> have C =
>>>>> GROUP B by $0 the resulting schema of C is (group, B).  So if  
>>>>> we don't
>>>>> assign an alias to the group key, we now have a schema ($0,  
>>>>> B).  This
>>>>> seems strange.  And worse yet, if users want to alias the group  
>>>>> key(s),
>>>>> they'll be forced to alias all the grouped bags as well.
>>>>>
>>>>> III) Carry the alias (if any) that the field had before.  So if  
>>>>> we had a
>>>>> script like:
>>>>>
>>>>> A = load 'myfile' as (x, y, z);
>>>>> B = group A by x;
>>>>>
>>>>> The the schema of B would be (x, A).  This is quite natural for  
>>>>> grouping
>>>>> of single columns.  But it turns nasty when you group on multiple
>>>>> columns.  Do we then append the names to together?  So if you have
>>>>>
>>>>> B = group A by x, y;
>>>>>
>>>>> is the resulting schema (x_y, A)?  Ugh.
>>>>>
>>>>> In this case there is also the question of what to do in the  
>>>>> case of
>>>>> cogroups, where the key may be named differently in different  
>>>>> relations.
>>>>>
>>>>> A = load 'myfile' as (x, y, z);
>>>>> B = load 'myotherfile' as (t, u, v);
>>>>> C = cogroup A by x, B by t;
>>>>>
>>>>> Is the resulting schema (x, A, B) or (t, A, B) or are both  
>>>>> valid?  This
>>>>> could be resolved by either saying first one always wins, or  
>>>>> allowing
>>>>> either.
>>>>>
>>>>> Pros:  Very natural for the users, their fields maintain names  
>>>>> through
>>>>> the query.
>>>>> Cons:  Quickly gets burdensome in the case of multi-key groups.
>>>>>
>>>>> IV) Assign a non-keyword alias to the group key, like grp or  
>>>>> groupkey or
>>>>> grpkey (or some other suitable choice).
>>>>> Pros:  Least disruptive change.  Users only have to go through  
>>>>> their
>>>>> scripts and find places where they use the group alias and  
>>>>> change it to
>>>>> grp (or whatever).
>>>>> Cons:  Still leaves us with a situation where we are assigning  
>>>>> a name to
>>>>> a field arbtrarily, leaving users confused as to how their  
>>>>> fields got
>>>>> named that.
>>>>>
>>>>> V) Remove GROUP as a keyword.  It is just short for COGROUP of one
>>>>> relation anyway.
>>>>>
>>>>> Pros:  Smaller syntax in a language is always good.
>>>>> Cons:  Will break a lot of scripts, and confuse a lot of users  
>>>>> who only
>>>>> think in terms of GROUP and JOIN and never use COGROUP explicitly.
>>>>>
>>>>> One could also conceive of combinations of these.  For example, we
>>>>> always assign a name like grpkey to the group key(s), and in  
>>>>> the single
>>>>> key case we also carry forward the alias that the field already  
>>>>> had, if
>>>>> any.
>>>>>
>>>>> Thoughts?  Other possibilities?
>>>>>
>>>>> Alan.
>>>
>>>
>>
>> -- 
>> Christopher Olston, Ph.D.
>> Sr. Research Scientist
>> Yahoo! Research
>>
>>
>>

--
Christopher Olston, Ph.D.
Sr. Research Scientist
Yahoo! Research



Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message