hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Olston <ols...@yahoo-inc.com>
Subject Re: Issues with group as an alias
Date Mon, 23 Jun 2008 20:15:16 GMT
I vote for (2), because just about every new user gets tripped up by  
the implicit column names. Take for example:

A = load ...;
B = load ...;
C = cogroup A by url, B by url;
D = foreach C {
              A = order A by $0;
              generate flatten(A), flatten(B.$2);
           };

Users think that the "order A by $0" is referring to the *outer  
table* called A, not the A that is a column of C. There are two  
reasons for this confusion, I think:
1. Rather than writing C.A (absolute path), the user must write A  
(relative path) -- the prefix of the path is implicit.
2. The fact that C has a field called "A" is not evident from  
inspecting the script -- C's schema does not appear anywhere.

Forcing the user to write:

C = cogroup A by url, B by url as (url, A, B);

(or whatever names they want to give the fields of C) would fix #2.

(We may also want to think about #1, but that's a separate issue from  
the current line of discussion.)

-Chris


On Jun 17, 2008, at 7:30 AM, Benjamin Reed wrote:

> I agree with Pi. +1 for (1).
>
> ben
>
> On Tuesday 17 June 2008 03:41:13 pi song wrote:
>> If it's confusing because our model is different, people just have to
>> learn. If it's confusing because it is misleading, it has to be  
>> fixed.
>>
>> As far as we can explain "why" logically, I think it should be ok.
>> I vote (1) for this.
>>
>> On Tue, Jun 17, 2008 at 8:03 AM, Chris Olston <olston@yahoo- 
>> inc.com> wrote:
>>> Oh -- sorry I misunderstood.
>>>
>>> That's a valid question and now is the right time to revisit it.  
>>> Does
>>> anybody see any natural naming convention *other than* naming  
>>> them after
>>> the input tables (pig's current practice)? If so, let's discuss.  
>>> If not,
>>> it seems the only two choices are: (1) leave it as-is, or (2) do not
>>> assign any name, and force user to use "AS" (this is what Jaql  
>>> does I
>>> believe).
>>>
>>> -Chris
>>>
>>>
>>> On Jun 16, 2008, at 1:29 PM, Olga Natkovich wrote:
>>>
>>>  Chris,
>>>
>>>> What I meant to ask was what do we do with the rest of the  
>>>> fields in the
>>>> group tuples. Currently, we name those fields with the names of the
>>>> correspondent tables. I was asking if we want to continue that.  
>>>> I know
>>>> that people find it confusing to see fields named after relations.
>>>>
>>>> Olga
>>>>
>>>>  -----Original Message-----
>>>>
>>>>> From: Chris Olston [mailto:olston@yahoo-inc.com]
>>>>> Sent: Monday, June 16, 2008 12:54 PM
>>>>> To: pig-dev@incubator.apache.org
>>>>> Subject: Re: Issues with group as an alias
>>>>>
>>>>> Olga,
>>>>>
>>>>> The idea is that when there is just one field with one name,
>>>>> we use that name for the group key. In all other cases we do
>>>>> *not* supply an automatic name (the user can assign their own
>>>>> name using "as").
>>>>>
>>>>> I believe this solution: (1) is very simple and unambiguous,
>>>>> and (2) makes common cases very natural (e.g, BAR = group FOO
>>>>> by URL; foreach BAR generate URL, ...).
>>>>>
>>>>> -Chris
>>>>>
>>>>> On Jun 16, 2008, at 12:48 PM, Olga Natkovich wrote:
>>>>>
>>>>>  What about naming the rest of the fields in the group? Do
>>>>>
>>>>> we want to
>>>>>
>>>>>> continue naming them with the names of the corresponding  
>>>>>> tables? I
>>>>>> think users find that confusing as well.
>>>>>>
>>>>>> Olga
>>>>>>
>>>>>>  -----Original Message-----
>>>>>>
>>>>>>> From: Alan Gates [mailto:gates@yahoo-inc.com]
>>>>>>> Sent: Monday, June 16, 2008 11:32 AM
>>>>>>> To: pig-dev@incubator.apache.org
>>>>>>> Subject: Re: Issues with group as an alias
>>>>>>>
>>>>>>> I would like to propose a slight modification:
>>>>>>>
>>>>>>> I think that we should continue to support 'group' as the
>>>>>>
>>>>>> alias name
>>>>>>
>>>>>> for some transition period (3 or maybe 6 months).
>>>>>>
>>>>>>> We can remove all references to group as an alias from the
>>>>>>> documentation and print a warning when users use it.  But I 

>>>>>>> don't
>>>>>>> think we should drop it immediately, as we'll break many  
>>>>>>> scripts.
>>>>>>>
>>>>>>> Other than that I'm fine with the proposal.
>>>>>>>
>>>>>>> Alan.
>>>>>>>
>>>>>>> Chris Olston wrote:
>>>>>>>> No.
>>>>>>>>
>>>>>>>> The standing proposal for Option III is:
>>>>>>>>
>>>>>>>> 1. If you are (CO)Grouping on a *single* field AND in the
 
>>>>>>>> case of
>>>>>>>> co-group all field names are the same (e.g., cogroup A by
>>>>>>>
>>>>>>> url, B by
>>>>>>
>>>>>> url), then give the group key that name (e.g., "url").
>>>>>>
>>>>>>>> 2. Else, do *not* automatically assign any name. The user
>>>>>>>
>>>>>>> can refer to
>>>>>>>
>>>>>>>> it as $0 and/or use "AS" to give it a name manually.
>>>>>>>>
>>>>>>>> (To be clear, even in case #1, the user has the option to
>>>>>>>
>>>>>>> override the
>>>>>>>
>>>>>>>> automatically-assigned name using "AS" if s/he chooses.)
>>>>>>>>
>>>>>>>> -Chris
>>>>>>>>
>>>>>>>>
>>>>>>>> On Jun 16, 2008, at 8:25 AM, Benjamin Reed wrote:
>>>>>>>>
>>>>>>>>  I completely agree. It does start getting confusing.
>>>>>>>>
>>>>>>>> Especially if we
>>>>>>>>
>>>>>>>> try to deal with multi field keys.
>>>>>>>>
>>>>>>>>> A = load 'somefile1' USING PigStorage() AS (B, C, Z)
B = load
>>>>>>>>> 'somefile2' USING PigStorage() AS (A, C, Y) C = load
 
>>>>>>>>> 'somefile3'
>>>>>>>>> USING PigStorage() AS (A, B)
>>>>>>>>>
>>>>>>>>> G1 = COGROUP A by (B,C), B by (A, C);
>>>>>>>>> G2 = COGROUP G1 by (B_C, A.Z), C by (A, B);
>>>>>>>>>
>>>>>>>>> What is the schema for G2?
>>>>>>>>>
>>>>>>>>> ben
>>>>>>>>>
>>>>>>>>> On Saturday 14 June 2008 06:46:00 Mridul Muralidharan
wrote:
>>>>>>>>>> So what is the conclusion here ?
>>>>>>>>>>
>>>>>>>>>> group key alias == the first variables group by field
?
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> What happens in a case like this then :
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> A = load 'somefile1' USING PigStorage() AS (B, C)
B = load
>>>>>>>>>> 'somefile2' USING PigStorage() AS (A, C) C = load
>>>>>>>>>
>>>>>>>>> 'somefile3' USING
>>>>>>>>
>>>>>>>> PigStorage() AS (A, B)
>>>>>>>>
>>>>>>>>>> G1 = COGROUP A by B, B by A;
>>>>>>>>>> G2 = COGROUP A by C, C by A;
>>>>>>>>>> ...
>>>>>>>>>> --
>>>>>>>>>>
>>>>>>>>>> A slightly contrived example for sure, but imo grammer
>>>>>>>>>
>>>>>>>>> has to be as
>>>>>>>>
>>>>>>>> clearly specified as possible.
>>>>>>>>
>>>>>>>>>> A reserved keyword as group alias implies we dont
hit
>>>>>>>>>
>>>>>>>>> this problem
>>>>>>>>
>>>>>>>> (group or groupkey or grpkey)... and also the fact that we
are
>>>>>>>>
>>>>>>>>>> backwardly compatible.
>>>>>>>>>>
>>>>>>>>>> [I never liked inferred schema prefix section in
the
>>>>>>>>>
>>>>>>>>> schemas doc
>>>>>>
>>>>>> (which is applied selectively) - makes it extremely tough to
>>>>>>
>>>>>>>>>> generate pig scripts]
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Regards,
>>>>>>>>>> Mridul
>>>>>>>>>>
>>>>>>>>>> Alan Gates wrote:
>>>>>>>>>>> Currently in Pig Latin, anytime a (CO)GROUP statement
is
>>>>>>>>>>
>>>>>>>>>> used, the
>>>>>>>>
>>>>>>>> field (or set of fields) that are grouped on are given
>>>>>>>>
>>>>>>>>>> the alias
>>>>>>
>>>>>> 'group'.
>>>>>>
>>>>>>>>>>> This has a couple of issues:
>>>>>>>>>>>
>>>>>>>>>>> 1)  It's confusing.  'group' is now a keyword
and an alias.
>>>>>>>>>>> 2)  We don't currently allow 'group' as an alias
in an
>>>>>>>>>>
>>>>>>>>>> AS.  It is
>>>>>>>>
>>>>>>>> strange to have an alias that can only be assigned by
>>>>>>>>
>>>>>>>>>> the language
>>>>>>>>
>>>>>>>> and never by the user.
>>>>>>>>
>>>>>>>>>>> Possible solutions:
>>>>>>>>>>>
>>>>>>>>>>> I) Status quo.  We could fix it so that group
is allowed  
>>>>>>>>>>> to be
>>>>>>>>>>> assigned as an alias in AS.
>>>>>>>>>>>
>>>>>>>>>>> Pros:  Backward compatibility
>>>>>>>>>>> Cons: a) will make the parser more complicated
>>>>>>>>>>>     b) see 1) above.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> II) Don't give an implicit alias to the group
key(s).
>>>>>>>>>>
>>>>>>>>>> If users
>>>>>>
>>>>>> want an alias, they can assign it using AS.
>>>>>>
>>>>>>>>>>> Pros:  Simplicity
>>>>>>>>>>> Cons:  We do assign aliases to grouped bags.
 That is,
>>>>>>>>>>
>>>>>>>>>> if we have C
>>>>>>>>
>>>>>>>> = GROUP B by $0 the resulting schema of C is (group, B).
>>>>>>>>
>>>>>>>>>>  So if we
>>>>>>>>
>>>>>>>> don't assign an alias to the group key, we now have a
>>>>>>>>
>>>>>>>>>> schema ($0,
>>>>>>>>
>>>>>>>> B).  This seems strange.  And worse yet, if users want
>>>>>>>>
>>>>>>>>>> to alias the
>>>>>>>>
>>>>>>>> group key(s), they'll be forced to alias all the
>>>>>>>>
>>>>>>>>>> grouped bags as
>>>>>>
>>>>>> well.
>>>>>>
>>>>>>>>>>> III) Carry the alias (if any) that the field
had before.
>>>>>>>>>>
>>>>>>>>>>  So if we
>>>>>>>>
>>>>>>>> had a script like:
>>>>>>>>>>> A = load 'myfile' as (x, y, z);
>>>>>>>>>>> B = group A by x;
>>>>>>>>>>>
>>>>>>>>>>> The the schema of B would be (x, A).  This is
quite
>>>>>>>>>>
>>>>>>>>>> natural for
>>>>>>
>>>>>> grouping of single columns.  But it turns nasty when you
>>>>>>
>>>>>>>>>> group on
>>>>>>>>
>>>>>>>> multiple columns.  Do we then append the names to
>>>>>>>>
>>>>>>>>>> together?  So if
>>>>>>>>
>>>>>>>> you have
>>>>>>>>
>>>>>>>>>>> B = group A by x, y;
>>>>>>>>>>>
>>>>>>>>>>> is the resulting schema (x_y, A)?  Ugh.
>>>>>>>>>>>
>>>>>>>>>>> In this case there is also the question of what
to do in
>>>>>>>>>>
>>>>>>>>>> the case
>>>>>>>>
>>>>>>>> of cogroups, where the key may be named differently in
>>>>>>>>
>>>>>>>>>> different
>>>>>>
>>>>>> relations.
>>>>>>
>>>>>>>>>>> A = load 'myfile' as (x, y, z);
>>>>>>>>>>> B = load 'myotherfile' as (t, u, v); C = cogroup
A by
>>>>>>>>>>
>>>>>>>>>> x, B by t;
>>>>>>>>>>
>>>>>>>>>>> Is the resulting schema (x, A, B) or (t, A, B)
or are
>>>>>>>>>>
>>>>>>>>>> both valid?
>>>>>>>>
>>>>>>>> This
>>>>>>>>
>>>>>>>>>>> could be resolved by either saying first one
always wins, or
>>>>>>>>>>> allowing either.
>>>>>>>>>>>
>>>>>>>>>>> Pros:  Very natural for the users, their fields
maintain  
>>>>>>>>>>> names
>>>>>>>>>>> through the query.
>>>>>>>>>>> Cons:  Quickly gets burdensome in the case of
multi-key  
>>>>>>>>>>> groups.
>>>>>>>>>>>
>>>>>>>>>>> IV) Assign a non-keyword alias to the group key,
like grp or
>>>>>>>>>>> groupkey or grpkey (or some other suitable choice).
>>>>>>>>>>> Pros:  Least disruptive change.  Users only have
to go  
>>>>>>>>>>> through
>>>>>>>>>>> their scripts and find places where they use
the group
>>>>>>>>>>
>>>>>>>>>> alias and
>>>>>>
>>>>>> change it to grp (or whatever).
>>>>>>
>>>>>>>>>>> Cons:  Still leaves us with a situation where
we are
>>>>>>>>>>
>>>>>>>>>> assigning a
>>>>>>
>>>>>> name to a field arbtrarily, leaving users confused as to
>>>>>>
>>>>>>>>>> how their
>>>>>>>>
>>>>>>>> fields got named that.
>>>>>>>>
>>>>>>>>>>> V) Remove GROUP as a keyword.  It is just short
for
>>>>>>>>>>
>>>>>>>>>> COGROUP of one
>>>>>>>>
>>>>>>>> relation anyway.
>>>>>>>>
>>>>>>>>>>> Pros:  Smaller syntax in a language is always
good.
>>>>>>>>>>> Cons:  Will break a lot of scripts, and confuse
a lot of
>>>>>>>>>>
>>>>>>>>>> users who
>>>>>>>>
>>>>>>>> only think in terms of GROUP and JOIN and never use COGROUP
>>>>>>>>
>>>>>>>>>>> explicitly.
>>>>>>>>>>>
>>>>>>>>>>> One could also conceive of combinations of these.
 For
>>>>>>>>>>
>>>>>>>>>> example, we
>>>>>>>>
>>>>>>>> always assign a name like grpkey to the group key(s),
>>>>>>>>
>>>>>>>>>> and in the
>>>>>>
>>>>>> single key case we also carry forward the alias that the field
>>>>>>
>>>>>>>>>>> already had, if any.
>>>>>>>>>>>
>>>>>>>>>>> Thoughts?  Other possibilities?
>>>>>>>>>>>
>>>>>>>>>>> Alan.
>>>>>>>>
>>>>>>>> --
>>>>>>>> Christopher Olston, Ph.D.
>>>>>>>> Sr. Research Scientist
>>>>>>>> Yahoo! Research
>>>>>
>>>>> --
>>>>> Christopher Olston, Ph.D.
>>>>> Sr. Research Scientist
>>>>> Yahoo! Research
>>>
>>> --
>>> Christopher Olston, Ph.D.
>>> Sr. Research Scientist
>>> Yahoo! Research
>
>

--
Christopher Olston, Ph.D.
Sr. Research Scientist
Yahoo! Research



Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message