hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Olston <ols...@yahoo-inc.com>
Subject Re: Issues with group as an alias
Date Mon, 09 Jun 2008 15:13:58 GMT
The issue of non-reserved keywords is orthogonal to the issue at  
hand: what is the most natural way to name the group key (i.e., we  
can still allow non-reserved keywords, and select a different way of  
naming the group key, if we want).

Every time I give a tutorial on Pig, people struggle to understand  
what this mysterious "group" field is. It is ugly and non-intuitive.

Option III is far more natural, and will cover 95% of the cases (for  
the rest of the cases, the user is doing something complicated so I  
think it's okay for them to name the group key manually).


On Jun 9, 2008, at 3:39 AM, pi song wrote:

> I prefer (I) and that means I want to allow non-reserved keywords.
> On Fri, Jun 6, 2008 at 9:33 AM, pi song <pi.songs@gmail.com> wrote:
>> I know it is very subjective to say I don't agree with "1)  It's
>> confusing". On developers' side, it is. But on users' side, it  
>> might not.
>> Some languages allow usage of keywords given they are used in the  
>> right
>> context. The current Pig implementation also allows referring to  
>> "group" as
>> an alias.
>> Before we jump to the solution, shouldn't it be better to make our
>> position clear on "Do we want every keyword to be reserved word  
>> regardless
>> of context?"
>> Pi
>> On 6/6/08, Chris Olston <olston@yahoo-inc.com> wrote:
>>> I vote for (III) -- propagate the alias. This makes the scripts very
>>> natural and readable, e.g.:
>>> a = group pages by host;
>>> b = foreach a generate host, count(pages);
>>> As for what to do in the case of grouping on multiple fields, or  
>>> co-group
>>> on differently-named fields, we should *not* assign a default  
>>> name -- the
>>> user can choose a name using "AS".
>>> -Chris
>>> On Jun 5, 2008, at 9:10 AM, Alan Gates wrote:
>>> Currently in Pig Latin, anytime a (CO)GROUP statement is used,  
>>> the field
>>>> (or set of fields) that are grouped on are given the alias  
>>>> 'group'.  This
>>>> has a couple of issues:
>>>> 1)  It's confusing.  'group' is now a keyword and an alias.
>>>> 2)  We don't currently allow 'group' as an alias in an AS.  It  
>>>> is strange
>>>> to have an alias that can only be assigned by the language and  
>>>> never by the
>>>> user.
>>>> Possible solutions:
>>>> I) Status quo.  We could fix it so that group is allowed to be  
>>>> assigned
>>>> as an alias in AS.
>>>> Pros:  Backward compatibility
>>>> Cons: a) will make the parser more complicated
>>>>     b) see 1) above.
>>>> II) Don't give an implicit alias to the group key(s).  If users  
>>>> want an
>>>> alias, they can assign it using AS.
>>>> Pros:  Simplicity
>>>> Cons:  We do assign aliases to grouped bags.  That is, if we  
>>>> have C =
>>>> GROUP B by $0 the resulting schema of C is (group, B).  So if we  
>>>> don't
>>>> assign an alias to the group key, we now have a schema ($0, B).   
>>>> This seems
>>>> strange.  And worse yet, if users want to alias the group key 
>>>> (s), they'll be
>>>> forced to alias all the grouped bags as well.
>>>> III) Carry the alias (if any) that the field had before.  So if  
>>>> we had a
>>>> script like:
>>>> A = load 'myfile' as (x, y, z);
>>>> B = group A by x;
>>>> The the schema of B would be (x, A).  This is quite natural for  
>>>> grouping
>>>> of single columns.  But it turns nasty when you group on  
>>>> multiple columns.
>>>>  Do we then append the names to together?  So if you have
>>>> B = group A by x, y;
>>>> is the resulting schema (x_y, A)?  Ugh.
>>>> In this case there is also the question of what to do in the  
>>>> case of
>>>> cogroups, where the key may be named differently in different  
>>>> relations.
>>>> A = load 'myfile' as (x, y, z);
>>>> B = load 'myotherfile' as (t, u, v);
>>>> C = cogroup A by x, B by t;
>>>> Is the resulting schema (x, A, B) or (t, A, B) or are both  
>>>> valid?  This
>>>> could be resolved by either saying first one always wins, or  
>>>> allowing
>>>> either.
>>>> Pros:  Very natural for the users, their fields maintain names  
>>>> through
>>>> the query.
>>>> Cons:  Quickly gets burdensome in the case of multi-key groups.
>>>> IV) Assign a non-keyword alias to the group key, like grp or  
>>>> groupkey or
>>>> grpkey (or some other suitable choice).
>>>> Pros:  Least disruptive change.  Users only have to go through  
>>>> their
>>>> scripts and find places where they use the group alias and  
>>>> change it to grp
>>>> (or whatever).
>>>> Cons:  Still leaves us with a situation where we are assigning a  
>>>> name to
>>>> a field arbtrarily, leaving users confused as to how their  
>>>> fields got named
>>>> that.
>>>> V) Remove GROUP as a keyword.  It is just short for COGROUP of one
>>>> relation anyway.
>>>> Pros:  Smaller syntax in a language is always good.
>>>> Cons:  Will break a lot of scripts, and confuse a lot of users  
>>>> who only
>>>> think in terms of GROUP and JOIN and never use COGROUP explicitly.
>>>> One could also conceive of combinations of these.  For example,  
>>>> we always
>>>> assign a name like grpkey to the group key(s), and in the single  
>>>> key case we
>>>> also carry forward the alias that the field already had, if any.
>>>> Thoughts?  Other possibilities?
>>>> Alan.
>>> --
>>> Christopher Olston, Ph.D.
>>> Sr. Research Scientist
>>> Yahoo! Research

Christopher Olston, Ph.D.
Sr. Research Scientist
Yahoo! Research

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message