hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ted Dunning" <ted.dunn...@gmail.com>
Subject Re: Issues with group as an alias
Date Sat, 14 Jun 2008 05:24:01 GMT
I think that I am convinced III is best.

On Fri, Jun 13, 2008 at 7:26 AM, Alan Gates <gates@yahoo-inc.com> wrote:

> All,
>
> I too will vote for III, with the caveat that we don't give names to
> multi-field grouping keys.  We need to make sure we support AS to allow the
> user to name their grouping keys if they want.
>
> So far, the vote totals are:
> I: 1
> II: 0
> III: 3
> IV: 0
> V: 0
>
> I'd like to make a decision and move forward by mid next week.  If you
> haven't voted and you'd like to, please do so now.  If you feel passionately
> about one of the options that is loosing, please make your arguments now.
>
> Alan.
>
> Alan Gates wrote:
>
>> Currently in Pig Latin, anytime a (CO)GROUP statement is used, the field
>> (or set of fields) that are grouped on are given the alias 'group'.  This
>> has a couple of issues:
>>
>> 1)  It's confusing.  'group' is now a keyword and an alias.
>> 2)  We don't currently allow 'group' as an alias in an AS.  It is strange
>> to have an alias that can only be assigned by the language and never by the
>> user.
>>
>> Possible solutions:
>>
>> I) Status quo.  We could fix it so that group is allowed to be assigned as
>> an alias in AS.
>>
>> Pros:  Backward compatibility
>> Cons: a) will make the parser more complicated
>>     b) see 1) above.
>>
>>
>> II) Don't give an implicit alias to the group key(s).  If users want an
>> alias, they can assign it using AS.
>>
>> Pros:  Simplicity
>> Cons:  We do assign aliases to grouped bags.  That is, if we have C =
>> GROUP B by $0 the resulting schema of C is (group, B).  So if we don't
>> assign an alias to the group key, we now have a schema ($0, B).  This seems
>> strange.  And worse yet, if users want to alias the group key(s), they'll be
>> forced to alias all the grouped bags as well.
>>
>> III) Carry the alias (if any) that the field had before.  So if we had a
>> script like:
>>
>> A = load 'myfile' as (x, y, z);
>> B = group A by x;
>>
>> The the schema of B would be (x, A).  This is quite natural for grouping
>> of single columns.  But it turns nasty when you group on multiple columns.
>>  Do we then append the names to together?  So if you have
>>
>> B = group A by x, y;
>>
>> is the resulting schema (x_y, A)?  Ugh.
>>
>> In this case there is also the question of what to do in the case of
>> cogroups, where the key may be named differently in different relations.
>>
>> A = load 'myfile' as (x, y, z);
>> B = load 'myotherfile' as (t, u, v);
>> C = cogroup A by x, B by t;
>>
>> Is the resulting schema (x, A, B) or (t, A, B) or are both valid?  This
>> could be resolved by either saying first one always wins, or allowing
>> either.
>>
>> Pros:  Very natural for the users, their fields maintain names through the
>> query.
>> Cons:  Quickly gets burdensome in the case of multi-key groups.
>>
>> IV) Assign a non-keyword alias to the group key, like grp or groupkey or
>> grpkey (or some other suitable choice).
>> Pros:  Least disruptive change.  Users only have to go through their
>> scripts and find places where they use the group alias and change it to grp
>> (or whatever).
>> Cons:  Still leaves us with a situation where we are assigning a name to a
>> field arbtrarily, leaving users confused as to how their fields got named
>> that.
>>
>> V) Remove GROUP as a keyword.  It is just short for COGROUP of one
>> relation anyway.
>>
>> Pros:  Smaller syntax in a language is always good.
>> Cons:  Will break a lot of scripts, and confuse a lot of users who only
>> think in terms of GROUP and JOIN and never use COGROUP explicitly.
>>
>> One could also conceive of combinations of these.  For example, we always
>> assign a name like grpkey to the group key(s), and in the single key case we
>> also carry forward the alias that the field already had, if any.
>>
>> Thoughts?  Other possibilities?
>>
>> Alan.
>>
>


-- 
ted

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message