hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Olston <ols...@yahoo-inc.com>
Subject Re: Issues with group as an alias
Date Mon, 16 Jun 2008 22:03:53 GMT
Oh -- sorry I misunderstood.

That's a valid question and now is the right time to revisit it. Does  
anybody see any natural naming convention *other than* naming them  
after the input tables (pig's current practice)? If so, let's  
discuss. If not, it seems the only two choices are: (1) leave it as- 
is, or (2) do not assign any name, and force user to use "AS" (this  
is what Jaql does I believe).

-Chris

On Jun 16, 2008, at 1:29 PM, Olga Natkovich wrote:

> Chris,
>
> What I meant to ask was what do we do with the rest of the fields  
> in the
> group tuples. Currently, we name those fields with the names of the
> correspondent tables. I was asking if we want to continue that. I know
> that people find it confusing to see fields named after relations.
>
> Olga
>
>> -----Original Message-----
>> From: Chris Olston [mailto:olston@yahoo-inc.com]
>> Sent: Monday, June 16, 2008 12:54 PM
>> To: pig-dev@incubator.apache.org
>> Subject: Re: Issues with group as an alias
>>
>> Olga,
>>
>> The idea is that when there is just one field with one name,
>> we use that name for the group key. In all other cases we do
>> *not* supply an automatic name (the user can assign their own
>> name using "as").
>>
>> I believe this solution: (1) is very simple and unambiguous,
>> and (2) makes common cases very natural (e.g, BAR = group FOO
>> by URL; foreach BAR generate URL, ...).
>>
>> -Chris
>>
>> On Jun 16, 2008, at 12:48 PM, Olga Natkovich wrote:
>>
>>> What about naming the rest of the fields in the group? Do
>> we want to
>>> continue naming them with the names of the corresponding tables? I
>>> think users find that confusing as well.
>>>
>>> Olga
>>>
>>>> -----Original Message-----
>>>> From: Alan Gates [mailto:gates@yahoo-inc.com]
>>>> Sent: Monday, June 16, 2008 11:32 AM
>>>> To: pig-dev@incubator.apache.org
>>>> Subject: Re: Issues with group as an alias
>>>>
>>>> I would like to propose a slight modification:
>>>>
>>>> I think that we should continue to support 'group' as the
>> alias name
>>>> for some transition period (3 or maybe 6 months).
>>>> We can remove all references to group as an alias from the
>>>> documentation and print a warning when users use it.  But I don't
>>>> think we should drop it immediately, as we'll break many scripts.
>>>>
>>>> Other than that I'm fine with the proposal.
>>>>
>>>> Alan.
>>>>
>>>> Chris Olston wrote:
>>>>> No.
>>>>>
>>>>> The standing proposal for Option III is:
>>>>>
>>>>> 1. If you are (CO)Grouping on a *single* field AND in the case of
>>>>> co-group all field names are the same (e.g., cogroup A by
>> url, B by
>>>>> url), then give the group key that name (e.g., "url").
>>>>> 2. Else, do *not* automatically assign any name. The user
>>>> can refer to
>>>>> it as $0 and/or use "AS" to give it a name manually.
>>>>>
>>>>> (To be clear, even in case #1, the user has the option to
>>>> override the
>>>>> automatically-assigned name using "AS" if s/he chooses.)
>>>>>
>>>>> -Chris
>>>>>
>>>>>
>>>>> On Jun 16, 2008, at 8:25 AM, Benjamin Reed wrote:
>>>>>
>>>>>> I completely agree. It does start getting confusing.
>>>> Especially if we
>>>>>> try to deal with multi field keys.
>>>>>>
>>>>>> A = load 'somefile1' USING PigStorage() AS (B, C, Z) B = load
>>>>>> 'somefile2' USING PigStorage() AS (A, C, Y) C = load 'somefile3'
>>>>>> USING PigStorage() AS (A, B)
>>>>>>
>>>>>> G1 = COGROUP A by (B,C), B by (A, C);
>>>>>> G2 = COGROUP G1 by (B_C, A.Z), C by (A, B);
>>>>>>
>>>>>> What is the schema for G2?
>>>>>>
>>>>>> ben
>>>>>>
>>>>>> On Saturday 14 June 2008 06:46:00 Mridul Muralidharan wrote:
>>>>>>> So what is the conclusion here ?
>>>>>>>
>>>>>>> group key alias == the first variables group by field ?
>>>>>>>
>>>>>>>
>>>>>>> What happens in a case like this then :
>>>>>>>
>>>>>>> --
>>>>>>> A = load 'somefile1' USING PigStorage() AS (B, C) B = load
>>>>>>> 'somefile2' USING PigStorage() AS (A, C) C = load
>>>> 'somefile3' USING
>>>>>>> PigStorage() AS (A, B)
>>>>>>>
>>>>>>> G1 = COGROUP A by B, B by A;
>>>>>>> G2 = COGROUP A by C, C by A;
>>>>>>> ...
>>>>>>> --
>>>>>>>
>>>>>>> A slightly contrived example for sure, but imo grammer
>>>> has to be as
>>>>>>> clearly specified as possible.
>>>>>>>
>>>>>>> A reserved keyword as group alias implies we dont hit
>>>> this problem
>>>>>>> (group or groupkey or grpkey)... and also the fact that we are
>>>>>>> backwardly compatible.
>>>>>>>
>>>>>>> [I never liked inferred schema prefix section in the
>> schemas doc
>>>>>>> (which is applied selectively) - makes it extremely tough to
>>>>>>> generate pig scripts]
>>>>>>>
>>>>>>>
>>>>>>> Regards,
>>>>>>> Mridul
>>>>>>>
>>>>>>> Alan Gates wrote:
>>>>>>>> Currently in Pig Latin, anytime a (CO)GROUP statement is
>>>> used, the
>>>>>>>> field (or set of fields) that are grouped on are given
>> the alias
>>>>>>>> 'group'.
>>>>>>>> This has a couple of issues:
>>>>>>>>
>>>>>>>> 1)  It's confusing.  'group' is now a keyword and an alias.
>>>>>>>> 2)  We don't currently allow 'group' as an alias in an
>>>> AS.  It is
>>>>>>>> strange to have an alias that can only be assigned by
>>>> the language
>>>>>>>> and never by the user.
>>>>>>>>
>>>>>>>> Possible solutions:
>>>>>>>>
>>>>>>>> I) Status quo.  We could fix it so that group is allowed
to be
>>>>>>>> assigned as an alias in AS.
>>>>>>>>
>>>>>>>> Pros:  Backward compatibility
>>>>>>>> Cons: a) will make the parser more complicated
>>>>>>>>      b) see 1) above.
>>>>>>>>
>>>>>>>>
>>>>>>>> II) Don't give an implicit alias to the group key(s).
>> If users
>>>>>>>> want an alias, they can assign it using AS.
>>>>>>>>
>>>>>>>> Pros:  Simplicity
>>>>>>>> Cons:  We do assign aliases to grouped bags.  That is,
>>>> if we have C
>>>>>>>> = GROUP B by $0 the resulting schema of C is (group, B).
>>>>  So if we
>>>>>>>> don't assign an alias to the group key, we now have a
>>>> schema ($0,
>>>>>>>> B).  This seems strange.  And worse yet, if users want
>>>> to alias the
>>>>>>>> group key(s), they'll be forced to alias all the
>> grouped bags as
>>>>>>>> well.
>>>>>>>>
>>>>>>>> III) Carry the alias (if any) that the field had before.
>>>>  So if we
>>>>>>>> had a script like:
>>>>>>>>
>>>>>>>> A = load 'myfile' as (x, y, z);
>>>>>>>> B = group A by x;
>>>>>>>>
>>>>>>>> The the schema of B would be (x, A).  This is quite
>> natural for
>>>>>>>> grouping of single columns.  But it turns nasty when you
>>>> group on
>>>>>>>> multiple columns.  Do we then append the names to
>>>> together?  So if
>>>>>>>> you have
>>>>>>>>
>>>>>>>> B = group A by x, y;
>>>>>>>>
>>>>>>>> is the resulting schema (x_y, A)?  Ugh.
>>>>>>>>
>>>>>>>> In this case there is also the question of what to do in
>>>> the case
>>>>>>>> of cogroups, where the key may be named differently in
>> different
>>>>>>>> relations.
>>>>>>>>
>>>>>>>> A = load 'myfile' as (x, y, z);
>>>>>>>> B = load 'myotherfile' as (t, u, v); C = cogroup A by
>> x, B by t;
>>>>>>>>
>>>>>>>> Is the resulting schema (x, A, B) or (t, A, B) or are
>>>> both valid?
>>>>>>>> This
>>>>>>>> could be resolved by either saying first one always wins,
or
>>>>>>>> allowing either.
>>>>>>>>
>>>>>>>> Pros:  Very natural for the users, their fields maintain
names
>>>>>>>> through the query.
>>>>>>>> Cons:  Quickly gets burdensome in the case of multi-key groups.
>>>>>>>>
>>>>>>>> IV) Assign a non-keyword alias to the group key, like grp
or
>>>>>>>> groupkey or grpkey (or some other suitable choice).
>>>>>>>> Pros:  Least disruptive change.  Users only have to go through
>>>>>>>> their scripts and find places where they use the group
>> alias and
>>>>>>>> change it to grp (or whatever).
>>>>>>>> Cons:  Still leaves us with a situation where we are
>> assigning a
>>>>>>>> name to a field arbtrarily, leaving users confused as to
>>>> how their
>>>>>>>> fields got named that.
>>>>>>>>
>>>>>>>> V) Remove GROUP as a keyword.  It is just short for
>>>> COGROUP of one
>>>>>>>> relation anyway.
>>>>>>>>
>>>>>>>> Pros:  Smaller syntax in a language is always good.
>>>>>>>> Cons:  Will break a lot of scripts, and confuse a lot of
>>>> users who
>>>>>>>> only think in terms of GROUP and JOIN and never use COGROUP
>>>>>>>> explicitly.
>>>>>>>>
>>>>>>>> One could also conceive of combinations of these.  For
>>>> example, we
>>>>>>>> always assign a name like grpkey to the group key(s),
>> and in the
>>>>>>>> single key case we also carry forward the alias that the
field
>>>>>>>> already had, if any.
>>>>>>>>
>>>>>>>> Thoughts?  Other possibilities?
>>>>>>>>
>>>>>>>> Alan.
>>>>>>
>>>>>>
>>>>>
>>>>> --
>>>>> Christopher Olston, Ph.D.
>>>>> Sr. Research Scientist
>>>>> Yahoo! Research
>>>>>
>>>>>
>>>>>
>>>>
>>
>> --
>> Christopher Olston, Ph.D.
>> Sr. Research Scientist
>> Yahoo! Research
>>
>>
>>

--
Christopher Olston, Ph.D.
Sr. Research Scientist
Yahoo! Research



Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message