Mailing-List: contact pig-dev-help@incubator.apache.org; run by ezmlm
Precedence: bulk
Reply-To: pig-dev@incubator.apache.org
Received-SPF: neutral (nike.apache.org: local policy)
DomainKey-Signature: a=rsa-sha1; s=serpent; d=yahoo-inc.com; c=nofws; q=dns;
	h=from:to:subject:date:user-agent:cc:references:in-reply-to:
	mime-version:content-type:content-transfer-encoding:
	content-disposition:message-id;
	b=g/RG3g/RUxL2JSjXpkDF+K8TQ0FSBFOEE1QcRalIppPEWPQ0YshiahaIAjcJgYlt
From: Benjamin Reed <breed@yahoo-inc.com>
To: pig-dev@incubator.apache.org
Subject: Re: Issues with group as an alias
Date: Thu, 5 Jun 2008 09:24:28 -0700
User-Agent: KMail/1.9.9
Cc: Alan Gates <gates@yahoo-inc.com>
References: <48481005.7060706@yahoo-inc.com>
In-Reply-To: <48481005.7060706@yahoo-inc.com>
MIME-Version: 1.0
Content-Type: text/plain;
  charset="iso-8859-1"
Content-Transfer-Encoding: 7bit
Content-Disposition: inline
Message-Id: <200806050924.28800.breed@yahoo-inc.com>

I prefer option I). I agree that "group" may not be the optimal choice for the 
alias name, but it's not that bad and I would opt for being less disruptive. 
As far as the parser change, I think it is just changing:

    <IDENTIFIER: (<LETTER>)+(<DIGIT>|<LETTER>|<SPECIALCHAR>|"::")*>
to 
    <IDENTIFIER: (<LETTER>)+(<DIGIT>|<LETTER>|<SPECIALCHAR>|"::")*|<GROUP>>

It might be good to open up a Jira so that we can actually record votes.

ben

On Thursday 05 June 2008 09:10:45 Alan Gates wrote:
> Currently in Pig Latin, anytime a (CO)GROUP statement is used, the field
> (or set of fields) that are grouped on are given the alias 'group'.
> This has a couple of issues:
>
> 1)  It's confusing.  'group' is now a keyword and an alias.
> 2)  We don't currently allow 'group' as an alias in an AS.  It is
> strange to have an alias that can only be assigned by the language and
> never by the user.
>
> Possible solutions:
>
> I) Status quo.  We could fix it so that group is allowed to be assigned
> as an alias in AS.
>
> Pros:  Backward compatibility
> Cons: a) will make the parser more complicated
>       b) see 1) above.
>
>
> II) Don't give an implicit alias to the group key(s).  If users want an
> alias, they can assign it using AS.
>
> Pros:  Simplicity
> Cons:  We do assign aliases to grouped bags.  That is, if we have C =
> GROUP B by $0 the resulting schema of C is (group, B).  So if we don't
> assign an alias to the group key, we now have a schema ($0, B).  This
> seems strange.  And worse yet, if users want to alias the group key(s),
> they'll be forced to alias all the grouped bags as well.
>
> III) Carry the alias (if any) that the field had before.  So if we had a
> script like:
>
> A = load 'myfile' as (x, y, z);
> B = group A by x;
>
> The the schema of B would be (x, A).  This is quite natural for grouping
> of single columns.  But it turns nasty when you group on multiple
> columns.  Do we then append the names to together?  So if you have
>
> B = group A by x, y;
>
> is the resulting schema (x_y, A)?  Ugh.
>
> In this case there is also the question of what to do in the case of
> cogroups, where the key may be named differently in different relations.
>
> A = load 'myfile' as (x, y, z);
> B = load 'myotherfile' as (t, u, v);
> C = cogroup A by x, B by t;
>
> Is the resulting schema (x, A, B) or (t, A, B) or are both valid?  This
> could be resolved by either saying first one always wins, or allowing
> either.
>
> Pros:  Very natural for the users, their fields maintain names through
> the query.
> Cons:  Quickly gets burdensome in the case of multi-key groups.
>
> IV) Assign a non-keyword alias to the group key, like grp or groupkey or
> grpkey (or some other suitable choice).
>
> Pros:  Least disruptive change.  Users only have to go through their
> scripts and find places where they use the group alias and change it to
> grp (or whatever).
> Cons:  Still leaves us with a situation where we are assigning a name to
> a field arbtrarily, leaving users confused as to how their fields got
> named that.
>
> V) Remove GROUP as a keyword.  It is just short for COGROUP of one
> relation anyway.
>
> Pros:  Smaller syntax in a language is always good.
> Cons:  Will break a lot of scripts, and confuse a lot of users who only
> think in terms of GROUP and JOIN and never use COGROUP explicitly.
>
> One could also conceive of combinations of these.  For example, we
> always assign a name like grpkey to the group key(s), and in the single
> key case we also carry forward the alias that the field already had, if
> any.
>
> Thoughts?  Other possibilities?
>
> Alan.