hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Benjamin Reed <br...@yahoo-inc.com>
Subject Re: Issues with group as an alias
Date Tue, 17 Jun 2008 14:30:58 GMT
I agree with Pi. +1 for (1).

ben

On Tuesday 17 June 2008 03:41:13 pi song wrote:
> If it's confusing because our model is different, people just have to
> learn. If it's confusing because it is misleading, it has to be fixed.
>
> As far as we can explain "why" logically, I think it should be ok.
> I vote (1) for this.
>
> On Tue, Jun 17, 2008 at 8:03 AM, Chris Olston <olston@yahoo-inc.com> wrote:
> > Oh -- sorry I misunderstood.
> >
> > That's a valid question and now is the right time to revisit it. Does
> > anybody see any natural naming convention *other than* naming them after
> > the input tables (pig's current practice)? If so, let's discuss. If not,
> > it seems the only two choices are: (1) leave it as-is, or (2) do not
> > assign any name, and force user to use "AS" (this is what Jaql does I
> > believe).
> >
> > -Chris
> >
> >
> > On Jun 16, 2008, at 1:29 PM, Olga Natkovich wrote:
> >
> >  Chris,
> >
> >> What I meant to ask was what do we do with the rest of the fields in the
> >> group tuples. Currently, we name those fields with the names of the
> >> correspondent tables. I was asking if we want to continue that. I know
> >> that people find it confusing to see fields named after relations.
> >>
> >> Olga
> >>
> >>  -----Original Message-----
> >>
> >>> From: Chris Olston [mailto:olston@yahoo-inc.com]
> >>> Sent: Monday, June 16, 2008 12:54 PM
> >>> To: pig-dev@incubator.apache.org
> >>> Subject: Re: Issues with group as an alias
> >>>
> >>> Olga,
> >>>
> >>> The idea is that when there is just one field with one name,
> >>> we use that name for the group key. In all other cases we do
> >>> *not* supply an automatic name (the user can assign their own
> >>> name using "as").
> >>>
> >>> I believe this solution: (1) is very simple and unambiguous,
> >>> and (2) makes common cases very natural (e.g, BAR = group FOO
> >>> by URL; foreach BAR generate URL, ...).
> >>>
> >>> -Chris
> >>>
> >>> On Jun 16, 2008, at 12:48 PM, Olga Natkovich wrote:
> >>>
> >>>  What about naming the rest of the fields in the group? Do
> >>>
> >>> we want to
> >>>
> >>>> continue naming them with the names of the corresponding tables? I
> >>>> think users find that confusing as well.
> >>>>
> >>>> Olga
> >>>>
> >>>>  -----Original Message-----
> >>>>
> >>>>> From: Alan Gates [mailto:gates@yahoo-inc.com]
> >>>>> Sent: Monday, June 16, 2008 11:32 AM
> >>>>> To: pig-dev@incubator.apache.org
> >>>>> Subject: Re: Issues with group as an alias
> >>>>>
> >>>>> I would like to propose a slight modification:
> >>>>>
> >>>>> I think that we should continue to support 'group' as the
> >>>>
> >>>> alias name
> >>>>
> >>>> for some transition period (3 or maybe 6 months).
> >>>>
> >>>>> We can remove all references to group as an alias from the
> >>>>> documentation and print a warning when users use it.  But I don't
> >>>>> think we should drop it immediately, as we'll break many scripts.
> >>>>>
> >>>>> Other than that I'm fine with the proposal.
> >>>>>
> >>>>> Alan.
> >>>>>
> >>>>> Chris Olston wrote:
> >>>>>> No.
> >>>>>>
> >>>>>> The standing proposal for Option III is:
> >>>>>>
> >>>>>> 1. If you are (CO)Grouping on a *single* field AND in the case
of
> >>>>>> co-group all field names are the same (e.g., cogroup A by
> >>>>>
> >>>>> url, B by
> >>>>
> >>>> url), then give the group key that name (e.g., "url").
> >>>>
> >>>>>> 2. Else, do *not* automatically assign any name. The user
> >>>>>
> >>>>> can refer to
> >>>>>
> >>>>>> it as $0 and/or use "AS" to give it a name manually.
> >>>>>>
> >>>>>> (To be clear, even in case #1, the user has the option to
> >>>>>
> >>>>> override the
> >>>>>
> >>>>>> automatically-assigned name using "AS" if s/he chooses.)
> >>>>>>
> >>>>>> -Chris
> >>>>>>
> >>>>>>
> >>>>>> On Jun 16, 2008, at 8:25 AM, Benjamin Reed wrote:
> >>>>>>
> >>>>>>  I completely agree. It does start getting confusing.
> >>>>>>
> >>>>>> Especially if we
> >>>>>>
> >>>>>> try to deal with multi field keys.
> >>>>>>
> >>>>>>> A = load 'somefile1' USING PigStorage() AS (B, C, Z) B =
load
> >>>>>>> 'somefile2' USING PigStorage() AS (A, C, Y) C = load 'somefile3'
> >>>>>>> USING PigStorage() AS (A, B)
> >>>>>>>
> >>>>>>> G1 = COGROUP A by (B,C), B by (A, C);
> >>>>>>> G2 = COGROUP G1 by (B_C, A.Z), C by (A, B);
> >>>>>>>
> >>>>>>> What is the schema for G2?
> >>>>>>>
> >>>>>>> ben
> >>>>>>>
> >>>>>>> On Saturday 14 June 2008 06:46:00 Mridul Muralidharan wrote:
> >>>>>>>> So what is the conclusion here ?
> >>>>>>>>
> >>>>>>>> group key alias == the first variables group by field
?
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> What happens in a case like this then :
> >>>>>>>>
> >>>>>>>> --
> >>>>>>>> A = load 'somefile1' USING PigStorage() AS (B, C) B
= load
> >>>>>>>> 'somefile2' USING PigStorage() AS (A, C) C = load
> >>>>>>>
> >>>>>>> 'somefile3' USING
> >>>>>>
> >>>>>> PigStorage() AS (A, B)
> >>>>>>
> >>>>>>>> G1 = COGROUP A by B, B by A;
> >>>>>>>> G2 = COGROUP A by C, C by A;
> >>>>>>>> ...
> >>>>>>>> --
> >>>>>>>>
> >>>>>>>> A slightly contrived example for sure, but imo grammer
> >>>>>>>
> >>>>>>> has to be as
> >>>>>>
> >>>>>> clearly specified as possible.
> >>>>>>
> >>>>>>>> A reserved keyword as group alias implies we dont hit
> >>>>>>>
> >>>>>>> this problem
> >>>>>>
> >>>>>> (group or groupkey or grpkey)... and also the fact that we are
> >>>>>>
> >>>>>>>> backwardly compatible.
> >>>>>>>>
> >>>>>>>> [I never liked inferred schema prefix section in the
> >>>>>>>
> >>>>>>> schemas doc
> >>>>
> >>>> (which is applied selectively) - makes it extremely tough to
> >>>>
> >>>>>>>> generate pig scripts]
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> Regards,
> >>>>>>>> Mridul
> >>>>>>>>
> >>>>>>>> Alan Gates wrote:
> >>>>>>>>> Currently in Pig Latin, anytime a (CO)GROUP statement
is
> >>>>>>>>
> >>>>>>>> used, the
> >>>>>>
> >>>>>> field (or set of fields) that are grouped on are given
> >>>>>>
> >>>>>>>> the alias
> >>>>
> >>>> 'group'.
> >>>>
> >>>>>>>>> This has a couple of issues:
> >>>>>>>>>
> >>>>>>>>> 1)  It's confusing.  'group' is now a keyword and
an alias.
> >>>>>>>>> 2)  We don't currently allow 'group' as an alias
in an
> >>>>>>>>
> >>>>>>>> AS.  It is
> >>>>>>
> >>>>>> strange to have an alias that can only be assigned by
> >>>>>>
> >>>>>>>> the language
> >>>>>>
> >>>>>> and never by the user.
> >>>>>>
> >>>>>>>>> Possible solutions:
> >>>>>>>>>
> >>>>>>>>> I) Status quo.  We could fix it so that group is
allowed to be
> >>>>>>>>> assigned as an alias in AS.
> >>>>>>>>>
> >>>>>>>>> Pros:  Backward compatibility
> >>>>>>>>> Cons: a) will make the parser more complicated
> >>>>>>>>>     b) see 1) above.
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> II) Don't give an implicit alias to the group key(s).
> >>>>>>>>
> >>>>>>>> If users
> >>>>
> >>>> want an alias, they can assign it using AS.
> >>>>
> >>>>>>>>> Pros:  Simplicity
> >>>>>>>>> Cons:  We do assign aliases to grouped bags.  That
is,
> >>>>>>>>
> >>>>>>>> if we have C
> >>>>>>
> >>>>>> = GROUP B by $0 the resulting schema of C is (group, B).
> >>>>>>
> >>>>>>>>  So if we
> >>>>>>
> >>>>>> don't assign an alias to the group key, we now have a
> >>>>>>
> >>>>>>>> schema ($0,
> >>>>>>
> >>>>>> B).  This seems strange.  And worse yet, if users want
> >>>>>>
> >>>>>>>> to alias the
> >>>>>>
> >>>>>> group key(s), they'll be forced to alias all the
> >>>>>>
> >>>>>>>> grouped bags as
> >>>>
> >>>> well.
> >>>>
> >>>>>>>>> III) Carry the alias (if any) that the field had
before.
> >>>>>>>>
> >>>>>>>>  So if we
> >>>>>>
> >>>>>> had a script like:
> >>>>>>>>> A = load 'myfile' as (x, y, z);
> >>>>>>>>> B = group A by x;
> >>>>>>>>>
> >>>>>>>>> The the schema of B would be (x, A).  This is quite
> >>>>>>>>
> >>>>>>>> natural for
> >>>>
> >>>> grouping of single columns.  But it turns nasty when you
> >>>>
> >>>>>>>> group on
> >>>>>>
> >>>>>> multiple columns.  Do we then append the names to
> >>>>>>
> >>>>>>>> together?  So if
> >>>>>>
> >>>>>> you have
> >>>>>>
> >>>>>>>>> B = group A by x, y;
> >>>>>>>>>
> >>>>>>>>> is the resulting schema (x_y, A)?  Ugh.
> >>>>>>>>>
> >>>>>>>>> In this case there is also the question of what
to do in
> >>>>>>>>
> >>>>>>>> the case
> >>>>>>
> >>>>>> of cogroups, where the key may be named differently in
> >>>>>>
> >>>>>>>> different
> >>>>
> >>>> relations.
> >>>>
> >>>>>>>>> A = load 'myfile' as (x, y, z);
> >>>>>>>>> B = load 'myotherfile' as (t, u, v); C = cogroup
A by
> >>>>>>>>
> >>>>>>>> x, B by t;
> >>>>>>>>
> >>>>>>>>> Is the resulting schema (x, A, B) or (t, A, B) or
are
> >>>>>>>>
> >>>>>>>> both valid?
> >>>>>>
> >>>>>> This
> >>>>>>
> >>>>>>>>> could be resolved by either saying first one always
wins, or
> >>>>>>>>> allowing either.
> >>>>>>>>>
> >>>>>>>>> Pros:  Very natural for the users, their fields
maintain names
> >>>>>>>>> through the query.
> >>>>>>>>> Cons:  Quickly gets burdensome in the case of multi-key
groups.
> >>>>>>>>>
> >>>>>>>>> IV) Assign a non-keyword alias to the group key,
like grp or
> >>>>>>>>> groupkey or grpkey (or some other suitable choice).
> >>>>>>>>> Pros:  Least disruptive change.  Users only have
to go through
> >>>>>>>>> their scripts and find places where they use the
group
> >>>>>>>>
> >>>>>>>> alias and
> >>>>
> >>>> change it to grp (or whatever).
> >>>>
> >>>>>>>>> Cons:  Still leaves us with a situation where we
are
> >>>>>>>>
> >>>>>>>> assigning a
> >>>>
> >>>> name to a field arbtrarily, leaving users confused as to
> >>>>
> >>>>>>>> how their
> >>>>>>
> >>>>>> fields got named that.
> >>>>>>
> >>>>>>>>> V) Remove GROUP as a keyword.  It is just short
for
> >>>>>>>>
> >>>>>>>> COGROUP of one
> >>>>>>
> >>>>>> relation anyway.
> >>>>>>
> >>>>>>>>> Pros:  Smaller syntax in a language is always good.
> >>>>>>>>> Cons:  Will break a lot of scripts, and confuse
a lot of
> >>>>>>>>
> >>>>>>>> users who
> >>>>>>
> >>>>>> only think in terms of GROUP and JOIN and never use COGROUP
> >>>>>>
> >>>>>>>>> explicitly.
> >>>>>>>>>
> >>>>>>>>> One could also conceive of combinations of these.
 For
> >>>>>>>>
> >>>>>>>> example, we
> >>>>>>
> >>>>>> always assign a name like grpkey to the group key(s),
> >>>>>>
> >>>>>>>> and in the
> >>>>
> >>>> single key case we also carry forward the alias that the field
> >>>>
> >>>>>>>>> already had, if any.
> >>>>>>>>>
> >>>>>>>>> Thoughts?  Other possibilities?
> >>>>>>>>>
> >>>>>>>>> Alan.
> >>>>>>
> >>>>>> --
> >>>>>> Christopher Olston, Ph.D.
> >>>>>> Sr. Research Scientist
> >>>>>> Yahoo! Research
> >>>
> >>> --
> >>> Christopher Olston, Ph.D.
> >>> Sr. Research Scientist
> >>> Yahoo! Research
> >
> > --
> > Christopher Olston, Ph.D.
> > Sr. Research Scientist
> > Yahoo! Research



Mime
View raw message