Return-Path: Delivered-To: apmail-incubator-pig-dev-archive@locus.apache.org Received: (qmail 19374 invoked from network); 5 Jun 2008 16:25:31 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 5 Jun 2008 16:25:31 -0000 Received: (qmail 12867 invoked by uid 500); 5 Jun 2008 16:25:34 -0000 Delivered-To: apmail-incubator-pig-dev-archive@incubator.apache.org Received: (qmail 12837 invoked by uid 500); 5 Jun 2008 16:25:34 -0000 Mailing-List: contact pig-dev-help@incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: pig-dev@incubator.apache.org Delivered-To: mailing list pig-dev@incubator.apache.org Received: (qmail 12826 invoked by uid 99); 5 Jun 2008 16:25:34 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 05 Jun 2008 09:25:34 -0700 X-ASF-Spam-Status: No, hits=1.2 required=10.0 tests=SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (nike.apache.org: local policy) Received: from [69.147.107.21] (HELO mrout2-b.corp.re1.yahoo.com) (69.147.107.21) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 05 Jun 2008 16:24:35 +0000 Received: from reasonpublic-lx.corp.yahoo.com (reasonpublic-lx.corp.yahoo.com [10.72.104.164]) by mrout2-b.corp.re1.yahoo.com (8.13.8/8.13.8/y.out) with ESMTP id m55GORr7074980; Thu, 5 Jun 2008 09:24:27 -0700 (PDT) DomainKey-Signature: a=rsa-sha1; s=serpent; d=yahoo-inc.com; c=nofws; q=dns; h=from:to:subject:date:user-agent:cc:references:in-reply-to: mime-version:content-type:content-transfer-encoding: content-disposition:message-id; b=g/RG3g/RUxL2JSjXpkDF+K8TQ0FSBFOEE1QcRalIppPEWPQ0YshiahaIAjcJgYlt From: Benjamin Reed To: pig-dev@incubator.apache.org Subject: Re: Issues with group as an alias Date: Thu, 5 Jun 2008 09:24:28 -0700 User-Agent: KMail/1.9.9 Cc: Alan Gates References: <48481005.7060706@yahoo-inc.com> In-Reply-To: <48481005.7060706@yahoo-inc.com> MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit Content-Disposition: inline Message-Id: <200806050924.28800.breed@yahoo-inc.com> X-Virus-Checked: Checked by ClamAV on apache.org I prefer option I). I agree that "group" may not be the optimal choice for the alias name, but it's not that bad and I would opt for being less disruptive. As far as the parser change, I think it is just changing: )+(|||"::")*> to )+(|||"::")*|> It might be good to open up a Jira so that we can actually record votes. ben On Thursday 05 June 2008 09:10:45 Alan Gates wrote: > Currently in Pig Latin, anytime a (CO)GROUP statement is used, the field > (or set of fields) that are grouped on are given the alias 'group'. > This has a couple of issues: > > 1) It's confusing. 'group' is now a keyword and an alias. > 2) We don't currently allow 'group' as an alias in an AS. It is > strange to have an alias that can only be assigned by the language and > never by the user. > > Possible solutions: > > I) Status quo. We could fix it so that group is allowed to be assigned > as an alias in AS. > > Pros: Backward compatibility > Cons: a) will make the parser more complicated > b) see 1) above. > > > II) Don't give an implicit alias to the group key(s). If users want an > alias, they can assign it using AS. > > Pros: Simplicity > Cons: We do assign aliases to grouped bags. That is, if we have C = > GROUP B by $0 the resulting schema of C is (group, B). So if we don't > assign an alias to the group key, we now have a schema ($0, B). This > seems strange. And worse yet, if users want to alias the group key(s), > they'll be forced to alias all the grouped bags as well. > > III) Carry the alias (if any) that the field had before. So if we had a > script like: > > A = load 'myfile' as (x, y, z); > B = group A by x; > > The the schema of B would be (x, A). This is quite natural for grouping > of single columns. But it turns nasty when you group on multiple > columns. Do we then append the names to together? So if you have > > B = group A by x, y; > > is the resulting schema (x_y, A)? Ugh. > > In this case there is also the question of what to do in the case of > cogroups, where the key may be named differently in different relations. > > A = load 'myfile' as (x, y, z); > B = load 'myotherfile' as (t, u, v); > C = cogroup A by x, B by t; > > Is the resulting schema (x, A, B) or (t, A, B) or are both valid? This > could be resolved by either saying first one always wins, or allowing > either. > > Pros: Very natural for the users, their fields maintain names through > the query. > Cons: Quickly gets burdensome in the case of multi-key groups. > > IV) Assign a non-keyword alias to the group key, like grp or groupkey or > grpkey (or some other suitable choice). > > Pros: Least disruptive change. Users only have to go through their > scripts and find places where they use the group alias and change it to > grp (or whatever). > Cons: Still leaves us with a situation where we are assigning a name to > a field arbtrarily, leaving users confused as to how their fields got > named that. > > V) Remove GROUP as a keyword. It is just short for COGROUP of one > relation anyway. > > Pros: Smaller syntax in a language is always good. > Cons: Will break a lot of scripts, and confuse a lot of users who only > think in terms of GROUP and JOIN and never use COGROUP explicitly. > > One could also conceive of combinations of these. For example, we > always assign a name like grpkey to the group key(s), and in the single > key case we also carry forward the alias that the field already had, if > any. > > Thoughts? Other possibilities? > > Alan.