hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Olston <ols...@yahoo-inc.com>
Subject Re: Issues with group as an alias
Date Mon, 16 Jun 2008 19:53:53 GMT
Olga,

The idea is that when there is just one field with one name, we use  
that name for the group key. In all other cases we do *not* supply an  
automatic name (the user can assign their own name using "as").

I believe this solution: (1) is very simple and unambiguous, and (2)  
makes common cases very natural (e.g, BAR = group FOO by URL; foreach  
BAR generate URL, ...).

-Chris

On Jun 16, 2008, at 12:48 PM, Olga Natkovich wrote:

> What about naming the rest of the fields in the group? Do we want to
> continue naming them with the names of the corresponding tables? I  
> think
> users find that confusing as well.
>
> Olga
>
>> -----Original Message-----
>> From: Alan Gates [mailto:gates@yahoo-inc.com]
>> Sent: Monday, June 16, 2008 11:32 AM
>> To: pig-dev@incubator.apache.org
>> Subject: Re: Issues with group as an alias
>>
>> I would like to propose a slight modification:
>>
>> I think that we should continue to support 'group' as the
>> alias name for some transition period (3 or maybe 6 months).
>> We can remove all references to group as an alias from the
>> documentation and print a warning when users use it.  But I
>> don't think we should drop it immediately, as we'll break
>> many scripts.
>>
>> Other than that I'm fine with the proposal.
>>
>> Alan.
>>
>> Chris Olston wrote:
>>> No.
>>>
>>> The standing proposal for Option III is:
>>>
>>> 1. If you are (CO)Grouping on a *single* field AND in the case of
>>> co-group all field names are the same (e.g., cogroup A by url, B by
>>> url), then give the group key that name (e.g., "url").
>>> 2. Else, do *not* automatically assign any name. The user
>> can refer to
>>> it as $0 and/or use "AS" to give it a name manually.
>>>
>>> (To be clear, even in case #1, the user has the option to
>> override the
>>> automatically-assigned name using "AS" if s/he chooses.)
>>>
>>> -Chris
>>>
>>>
>>> On Jun 16, 2008, at 8:25 AM, Benjamin Reed wrote:
>>>
>>>> I completely agree. It does start getting confusing.
>> Especially if we
>>>> try to deal with multi field keys.
>>>>
>>>> A = load 'somefile1' USING PigStorage() AS (B, C, Z) B = load
>>>> 'somefile2' USING PigStorage() AS (A, C, Y) C = load 'somefile3'
>>>> USING PigStorage() AS (A, B)
>>>>
>>>> G1 = COGROUP A by (B,C), B by (A, C);
>>>> G2 = COGROUP G1 by (B_C, A.Z), C by (A, B);
>>>>
>>>> What is the schema for G2?
>>>>
>>>> ben
>>>>
>>>> On Saturday 14 June 2008 06:46:00 Mridul Muralidharan wrote:
>>>>> So what is the conclusion here ?
>>>>>
>>>>> group key alias == the first variables group by field ?
>>>>>
>>>>>
>>>>> What happens in a case like this then :
>>>>>
>>>>> --
>>>>> A = load 'somefile1' USING PigStorage() AS (B, C) B = load
>>>>> 'somefile2' USING PigStorage() AS (A, C) C = load
>> 'somefile3' USING
>>>>> PigStorage() AS (A, B)
>>>>>
>>>>> G1 = COGROUP A by B, B by A;
>>>>> G2 = COGROUP A by C, C by A;
>>>>> ...
>>>>> --
>>>>>
>>>>> A slightly contrived example for sure, but imo grammer
>> has to be as
>>>>> clearly specified as possible.
>>>>>
>>>>> A reserved keyword as group alias implies we dont hit
>> this problem
>>>>> (group or groupkey or grpkey)... and also the fact that we are
>>>>> backwardly compatible.
>>>>>
>>>>> [I never liked inferred schema prefix section in the schemas doc
>>>>> (which is applied selectively) - makes it extremely tough to
>>>>> generate pig scripts]
>>>>>
>>>>>
>>>>> Regards,
>>>>> Mridul
>>>>>
>>>>> Alan Gates wrote:
>>>>>> Currently in Pig Latin, anytime a (CO)GROUP statement is
>> used, the
>>>>>> field (or set of fields) that are grouped on are given the alias
>>>>>> 'group'.
>>>>>> This has a couple of issues:
>>>>>>
>>>>>> 1)  It's confusing.  'group' is now a keyword and an alias.
>>>>>> 2)  We don't currently allow 'group' as an alias in an
>> AS.  It is
>>>>>> strange to have an alias that can only be assigned by
>> the language
>>>>>> and never by the user.
>>>>>>
>>>>>> Possible solutions:
>>>>>>
>>>>>> I) Status quo.  We could fix it so that group is allowed to be
>>>>>> assigned as an alias in AS.
>>>>>>
>>>>>> Pros:  Backward compatibility
>>>>>> Cons: a) will make the parser more complicated
>>>>>>      b) see 1) above.
>>>>>>
>>>>>>
>>>>>> II) Don't give an implicit alias to the group key(s).  If users
>>>>>> want an alias, they can assign it using AS.
>>>>>>
>>>>>> Pros:  Simplicity
>>>>>> Cons:  We do assign aliases to grouped bags.  That is,
>> if we have C
>>>>>> = GROUP B by $0 the resulting schema of C is (group, B).
>>  So if we
>>>>>> don't assign an alias to the group key, we now have a
>> schema ($0,
>>>>>> B).  This seems strange.  And worse yet, if users want
>> to alias the
>>>>>> group key(s), they'll be forced to alias all the grouped bags as
>>>>>> well.
>>>>>>
>>>>>> III) Carry the alias (if any) that the field had before.
>>  So if we
>>>>>> had a script like:
>>>>>>
>>>>>> A = load 'myfile' as (x, y, z);
>>>>>> B = group A by x;
>>>>>>
>>>>>> The the schema of B would be (x, A).  This is quite natural for
>>>>>> grouping of single columns.  But it turns nasty when you
>> group on
>>>>>> multiple columns.  Do we then append the names to
>> together?  So if
>>>>>> you have
>>>>>>
>>>>>> B = group A by x, y;
>>>>>>
>>>>>> is the resulting schema (x_y, A)?  Ugh.
>>>>>>
>>>>>> In this case there is also the question of what to do in
>> the case
>>>>>> of cogroups, where the key may be named differently in different
>>>>>> relations.
>>>>>>
>>>>>> A = load 'myfile' as (x, y, z);
>>>>>> B = load 'myotherfile' as (t, u, v); C = cogroup A by x, B by t;
>>>>>>
>>>>>> Is the resulting schema (x, A, B) or (t, A, B) or are
>> both valid?
>>>>>> This
>>>>>> could be resolved by either saying first one always wins, or
>>>>>> allowing either.
>>>>>>
>>>>>> Pros:  Very natural for the users, their fields maintain names
>>>>>> through the query.
>>>>>> Cons:  Quickly gets burdensome in the case of multi-key groups.
>>>>>>
>>>>>> IV) Assign a non-keyword alias to the group key, like grp or
>>>>>> groupkey or grpkey (or some other suitable choice).
>>>>>> Pros:  Least disruptive change.  Users only have to go through
>>>>>> their scripts and find places where they use the group alias and
>>>>>> change it to grp (or whatever).
>>>>>> Cons:  Still leaves us with a situation where we are assigning a
>>>>>> name to a field arbtrarily, leaving users confused as to
>> how their
>>>>>> fields got named that.
>>>>>>
>>>>>> V) Remove GROUP as a keyword.  It is just short for
>> COGROUP of one
>>>>>> relation anyway.
>>>>>>
>>>>>> Pros:  Smaller syntax in a language is always good.
>>>>>> Cons:  Will break a lot of scripts, and confuse a lot of
>> users who
>>>>>> only think in terms of GROUP and JOIN and never use COGROUP
>>>>>> explicitly.
>>>>>>
>>>>>> One could also conceive of combinations of these.  For
>> example, we
>>>>>> always assign a name like grpkey to the group key(s), and in the
>>>>>> single key case we also carry forward the alias that the field
>>>>>> already had, if any.
>>>>>>
>>>>>> Thoughts?  Other possibilities?
>>>>>>
>>>>>> Alan.
>>>>
>>>>
>>>
>>> --
>>> Christopher Olston, Ph.D.
>>> Sr. Research Scientist
>>> Yahoo! Research
>>>
>>>
>>>
>>

--
Christopher Olston, Ph.D.
Sr. Research Scientist
Yahoo! Research



Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message