hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Olga Natkovich" <ol...@yahoo-inc.com>
Subject RE: Issues with group as an alias
Date Fri, 13 Jun 2008 17:41:09 GMT
I am fine with 3. Can we in addition allow access by position as well as
let users assign their own names?

Olga 

> -----Original Message-----
> From: Alan Gates [mailto:gates@yahoo-inc.com] 
> Sent: Friday, June 13, 2008 7:26 AM
> To: pig-dev@incubator.apache.org
> Subject: Re: Issues with group as an alias
> 
> All,
> 
> I too will vote for III, with the caveat that we don't give 
> names to multi-field grouping keys.  We need to make sure we 
> support AS to allow the user to name their grouping keys if they want.
> 
> So far, the vote totals are:
> I: 1
> II: 0
> III: 3
> IV: 0
> V: 0
> 
> I'd like to make a decision and move forward by mid next 
> week.  If you haven't voted and you'd like to, please do so 
> now.  If you feel passionately about one of the options that 
> is loosing, please make your arguments now.
> 
> Alan.
> 
> Alan Gates wrote:
> > Currently in Pig Latin, anytime a (CO)GROUP statement is used, the 
> > field (or set of fields) that are grouped on are given the alias 
> > 'group'.  This has a couple of issues:
> >
> > 1)  It's confusing.  'group' is now a keyword and an alias.
> > 2)  We don't currently allow 'group' as an alias in an AS.  It is 
> > strange to have an alias that can only be assigned by the 
> language and 
> > never by the user.
> >
> > Possible solutions:
> >
> > I) Status quo.  We could fix it so that group is allowed to be 
> > assigned as an alias in AS.
> >
> > Pros:  Backward compatibility
> > Cons: a) will make the parser more complicated
> >      b) see 1) above.
> >
> >
> > II) Don't give an implicit alias to the group key(s).  If 
> users want 
> > an alias, they can assign it using AS.
> >
> > Pros:  Simplicity
> > Cons:  We do assign aliases to grouped bags.  That is, if 
> we have C = 
> > GROUP B by $0 the resulting schema of C is (group, B).  So 
> if we don't 
> > assign an alias to the group key, we now have a schema ($0, 
> B).  This 
> > seems strange.  And worse yet, if users want to alias the group 
> > key(s), they'll be forced to alias all the grouped bags as well.
> >
> > III) Carry the alias (if any) that the field had before.  
> So if we had 
> > a script like:
> >
> > A = load 'myfile' as (x, y, z);
> > B = group A by x;
> >
> > The the schema of B would be (x, A).  This is quite natural for 
> > grouping of single columns.  But it turns nasty when you group on 
> > multiple columns.  Do we then append the names to together? 
>  So if you 
> > have
> >
> > B = group A by x, y;
> >
> > is the resulting schema (x_y, A)?  Ugh.
> >
> > In this case there is also the question of what to do in 
> the case of 
> > cogroups, where the key may be named differently in 
> different relations.
> >
> > A = load 'myfile' as (x, y, z);
> > B = load 'myotherfile' as (t, u, v);
> > C = cogroup A by x, B by t;
> >
> > Is the resulting schema (x, A, B) or (t, A, B) or are both valid?  
> > This could be resolved by either saying first one always wins, or 
> > allowing either.
> >
> > Pros:  Very natural for the users, their fields maintain 
> names through 
> > the query.
> > Cons:  Quickly gets burdensome in the case of multi-key groups.
> >
> > IV) Assign a non-keyword alias to the group key, like grp 
> or groupkey 
> > or grpkey (or some other suitable choice).
> > Pros:  Least disruptive change.  Users only have to go 
> through their 
> > scripts and find places where they use the group alias and 
> change it 
> > to grp (or whatever).
> > Cons:  Still leaves us with a situation where we are 
> assigning a name 
> > to a field arbtrarily, leaving users confused as to how 
> their fields 
> > got named that.
> >
> > V) Remove GROUP as a keyword.  It is just short for COGROUP of one 
> > relation anyway.
> >
> > Pros:  Smaller syntax in a language is always good.
> > Cons:  Will break a lot of scripts, and confuse a lot of users who 
> > only think in terms of GROUP and JOIN and never use COGROUP 
> explicitly.
> >
> > One could also conceive of combinations of these.  For example, we 
> > always assign a name like grpkey to the group key(s), and in the 
> > single key case we also carry forward the alias that the 
> field already 
> > had, if any.
> >
> > Thoughts?  Other possibilities?
> >
> > Alan.
> 

Mime
View raw message