hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "pi song" <pi.so...@gmail.com>
Subject Re: Issues with group as an alias
Date Thu, 05 Jun 2008 23:33:09 GMT
I know it is very subjective to say I don't agree with "1)  It's confusing".
On developers' side, it is. But on users' side, it might not.

Some languages allow usage of keywords given they are used in the right
context. The current Pig implementation also allows referring to "group" as
an alias.

Before we jump to the solution, shouldn't it be better to make our
position clear on "Do we want every keyword to be reserved word regardless
of context?"


On 6/6/08, Chris Olston <olston@yahoo-inc.com> wrote:
> I vote for (III) -- propagate the alias. This makes the scripts very
> natural and readable, e.g.:
> a = group pages by host;
> b = foreach a generate host, count(pages);
> As for what to do in the case of grouping on multiple fields, or co-group
> on differently-named fields, we should *not* assign a default name -- the
> user can choose a name using "AS".
> -Chris
> On Jun 5, 2008, at 9:10 AM, Alan Gates wrote:
> Currently in Pig Latin, anytime a (CO)GROUP statement is used, the field
>> (or set of fields) that are grouped on are given the alias 'group'.  This
>> has a couple of issues:
>> 1)  It's confusing.  'group' is now a keyword and an alias.
>> 2)  We don't currently allow 'group' as an alias in an AS.  It is strange
>> to have an alias that can only be assigned by the language and never by the
>> user.
>> Possible solutions:
>> I) Status quo.  We could fix it so that group is allowed to be assigned as
>> an alias in AS.
>> Pros:  Backward compatibility
>> Cons: a) will make the parser more complicated
>>     b) see 1) above.
>> II) Don't give an implicit alias to the group key(s).  If users want an
>> alias, they can assign it using AS.
>> Pros:  Simplicity
>> Cons:  We do assign aliases to grouped bags.  That is, if we have C =
>> GROUP B by $0 the resulting schema of C is (group, B).  So if we don't
>> assign an alias to the group key, we now have a schema ($0, B).  This seems
>> strange.  And worse yet, if users want to alias the group key(s), they'll be
>> forced to alias all the grouped bags as well.
>> III) Carry the alias (if any) that the field had before.  So if we had a
>> script like:
>> A = load 'myfile' as (x, y, z);
>> B = group A by x;
>> The the schema of B would be (x, A).  This is quite natural for grouping
>> of single columns.  But it turns nasty when you group on multiple columns.
>>  Do we then append the names to together?  So if you have
>> B = group A by x, y;
>> is the resulting schema (x_y, A)?  Ugh.
>> In this case there is also the question of what to do in the case of
>> cogroups, where the key may be named differently in different relations.
>> A = load 'myfile' as (x, y, z);
>> B = load 'myotherfile' as (t, u, v);
>> C = cogroup A by x, B by t;
>> Is the resulting schema (x, A, B) or (t, A, B) or are both valid?  This
>> could be resolved by either saying first one always wins, or allowing
>> either.
>> Pros:  Very natural for the users, their fields maintain names through the
>> query.
>> Cons:  Quickly gets burdensome in the case of multi-key groups.
>> IV) Assign a non-keyword alias to the group key, like grp or groupkey or
>> grpkey (or some other suitable choice).
>> Pros:  Least disruptive change.  Users only have to go through their
>> scripts and find places where they use the group alias and change it to grp
>> (or whatever).
>> Cons:  Still leaves us with a situation where we are assigning a name to a
>> field arbtrarily, leaving users confused as to how their fields got named
>> that.
>> V) Remove GROUP as a keyword.  It is just short for COGROUP of one
>> relation anyway.
>> Pros:  Smaller syntax in a language is always good.
>> Cons:  Will break a lot of scripts, and confuse a lot of users who only
>> think in terms of GROUP and JOIN and never use COGROUP explicitly.
>> One could also conceive of combinations of these.  For example, we always
>> assign a name like grpkey to the group key(s), and in the single key case we
>> also carry forward the alias that the field already had, if any.
>> Thoughts?  Other possibilities?
>> Alan.
> --
> Christopher Olston, Ph.D.
> Sr. Research Scientist
> Yahoo! Research

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message