pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Prasanth J <buckeye.prasa...@gmail.com>
Subject Re: CUBE/ROLLUP/GROUPING SETS syntax
Date Thu, 21 Jun 2012 20:43:19 GMT
Yeah you are right.  

Thanks
-- Prasanth

On Jun 21, 2012, at 4:41 PM, Jonathan Coveney wrote:

> Just to make sure I understand this correctly, is
> 
> out = CUBE rel BY CUBE(a,b,c), ROLLUP(c,d), CUBE(e,f);
> 
> equivalent to:
> 
> out1 = CUBE rel BY (a,b,c);
> out2 = ROLLUP rel BY (c,d);
> out3 = CUBY rel BY (e,f);
> 
> out = CROSS out1, out2, out3;
> 
> ?
> 
> 2012/6/21 Prasanth J <buckeye.prasanth@gmail.com>
> 
>> Hello all
>> 
>> I initially implemented ROLLUP as a separate operation with the following
>> syntax
>> 
>> a = ROLLUP inp BY (x,y);
>> 
>> which does the same thing as CUBE (inserting foreach + group-by in logical
>> plan) except that it uses RollupDimensions UDF. But the issue with this
>> approach is that we cannot mix CUBE and ROLLUP operations together in the
>> same syntax which is a typical case. SQL/Oracle supports using CUBE and
>> ROLLUP together like
>> 
>> GROUP BY CUBE(a,b,c), ROLLUP(c,d), CUBE(e,f);
>> 
>> so I modified the pig grammar to support the similar usage. So now we can
>> use a syntax similar to SQL
>> 
>> out = CUBE rel BY CUBE(a,b,c), ROLLUP(c,d), CUBE(e,f);
>> 
>> In this approach, the logical plan should introduce cartesian product
>> between bags generated by CUBE(a,b,c), ROLLUP(c,d) and CUBE(e,f) for
>> generating the final output. But I read from the documentation (
>> http://pig.apache.org/docs/r0.10.0/basic.html#cross) that CROSS operator
>> is an expensive operator and advices to use it sparingly.
>> 
>> Is there any other way to achieve the cartesian product in a less
>> expensive way? Also, does anyone have thoughts about this new syntax?
>> 
>> Thanks
>> -- Prasanth
>> 
>> On May 30, 2012, at 8:10 PM, Jonathan Coveney wrote:
>> 
>>> As far as the underlying implementation, if they all use the same
>>> optimizations that you use in cube, then it can be LOCube. If they have
>>> their own optimizations etc (or could), it may be worth them having their
>>> own Logical operators (which might just be LOCube with flags for the time
>>> being) that allows us more flexibilty. But I suppose that's between you,
>>> eclipse, and your GSOC mentor.
>>> 
>>> 2012/5/30 Prasanth J <buckeye.prasanth@gmail.com>
>>> 
>>>> Thanks Alan and Jon for expressing your views.
>>>> 
>>>> I agree with Jon's point, if the syntax contains CUBE then user expects
>> it
>>>> to perform CUBE operation. So Jon's syntax seems more meaningful and
>> concise
>>>> 
>>>> rel = CUBE rel BY (dims);
>>>> rel = ROLLUP rel BY (dims);
>>>> rel = GROUPING_SET rel BY (dims);
>>>> 
>>>> 2 reasons why I do not prefer using SQL syntax is
>>>> 1) I do not want to break into existing Group operator implementation :)
>>>> 2) The syntax gets longer in case of partial hierarchical cubing/rollups
>>>> For ex:
>>>> 
>>>> rel = GROUP rel BY dim0, ROLLUP(dim1, dim2, dim3),
>> ROLLUP(dim4,dim5,dim6),
>>>> ROLLUP(dim7,dim8,dim9);
>>>> 
>>>> whereas same thing can be expressed like
>>>> 
>>>> rel = ROLLUP rel BY dim0,
>>>> (dim1,dim2,dim3),(dim4,dim5,dim6),(dim7,dim8,dim9);
>>>> 
>>>> Thanks Alan for pointing out the way for independently managing the
>>>> operators in parser and logical/physical plan. So for all these
>> operators
>>>> (CUBE, ROLLUP, GROUPING_SET) I can just generate LOCube and use flags to
>>>> differentiate between these three operations.
>>>> 
>>>> But, yes we are proliferating operators in this case.
>>>> 
>>>> Thanks
>>>> -- Prasanth
>>>> 
>>>> On May 30, 2012, at 4:42 PM, Alan Gates wrote:
>>>> 
>>>>> 
>>>>> On May 30, 2012, at 10:43 AM, Jonathan Coveney wrote:
>>>>> 
>>>>>> I was going to say the same thing Alan said w.r.t. operators:
>> operators
>>>> in
>>>>>> the grammar can correspond to whatever logical and physical operators
>>>> you
>>>>>> want.
>>>>>> 
>>>>>> As far as the principle of least astonishment compared to SQL...
Pig
>> is
>>>>>> already pretty astonishing. I don't know why we would bend over
>>>> backwards
>>>>>> to make the syntax so similar in this case when even getting to the
>>>> point
>>>>>> of doing a CUBE means understanding an object model that is pretty
>>>>>> different from SQL.
>>>>>> 
>>>>>> On that note,
>>>>>> 
>>>>>> rel = CUBE rel BY GROUPING SETS(cols);
>>>>>> 
>>>>>> seems really confusing. I'd much rather overload the group operating
>>>> than
>>>>>> the cube operator. If I see "cube," I expect a cube. If you start
>> doing
>>>>>> rollups etc, that's not a cube, it's a group. Or it's just a rollup.
>> Pig
>>>>>> latin is simple enough that I don't think having a rollup, group_set,
>>>> etc
>>>>>> operator will be so confusing, because they're already going to be
>>>> typing
>>>>>> that stuff in the conext of
>>>>>> 
>>>>>> group rel by rollup(cols); and so on. I don't see how it's worth
>> adding
>>>>>> more, confusing syntax for the sake of creating parallels with a
>>>> language
>>>>>> we now share very little with.
>>>>> 
>>>>> Fair points.
>>>>> 
>>>>>> 
>>>>>> But I won't beat it any further... if people prefer a different
>> syntax,
>>>>>> that's fine. Just excited to have the features in Pig!
>>>>> +1, I can live with any of the 3 syntax choices (near SQL, original,
>> and
>>>> Jon's).
>>>>> 
>>>>> Alan.
>>>>> 
>>>>>> Jon
>>>>>> 
>>>>>> 2012/5/30 Alan Gates <gates@hortonworks.com>
>>>>>> 
>>>>>>> Some thoughts on this:
>>>>>>> 
>>>>>>> 1) +1 to what Dmitriy said on HAVING
>>>>>>> 
>>>>>>> 2) We need to be clear about separating operators in the grammar
>> versus
>>>>>>> logical plan versus physical plan.  The choices you make in the
>>>> grammar are
>>>>>>> totally independent of the other two.  That is, you could choose
the
>>>> syntax:
>>>>>>> 
>>>>>>> rel = GROUP rel BY CUBE (a, b, c)
>>>>>>> 
>>>>>>> and still have a separate POCube operator.  When the parser sees
>> GROUP
>>>> BY
>>>>>>> CUBE it will generate an LOCube operator for the logical plan
rather
>>>> than
>>>>>>> an LOGroup operator.  You can still have a separate POCube physical
>>>>>>> operator.  Separate optimizations can be applied to LOGroup vs.
>> LOCube
>>>> and
>>>>>>> POGroup vs. POCube.
>>>>>>> 
>>>>>>> 3) On syntax I can see arguments for keeping as close to SQL
as
>>>> possible
>>>>>>> and for the syntax proposed by Prasanth.  The argument for sticking
>>>> close
>>>>>>> to SQL is it conforms to the law of least astonishment.  It wouldn't
>> be
>>>>>>> exactly SQL, as it would end up looking like:
>>>>>>> 
>>>>>>> rel = GROUP rel BY CUBE (cols)
>>>>>>> rel = GROUP rel BY ROLLUP (cols)
>>>>>>> rel = GROUP rel BY GROUPING SETS(cols);
>>>>>>> 
>>>>>>> The argument I see for sticking with Prasanth's approach is that
>> GROUP
>>>> is
>>>>>>> really short for COGROUP in Pig Latin, and I don't think we're
>>>> proposing
>>>>>>> doing COGROUP rel BY CUBE, nor can I see a case where you'd want
to
>> do
>>>> such
>>>>>>> a thing.  This makes CUBE really a separate operation.  But if
we go
>>>> this
>>>>>>> route I agree with Prasanth we should do CUBE rel BY ROLLUP and
CUBE
>>>> rel BY
>>>>>>> GROUPING SETS.  Let's not proliferate operators.
>>>>>>> 
>>>>>>> Alan.
>>>>>>> 
>>>>>>> On May 29, 2012, at 3:55 PM, Prasanth J wrote:
>>>>>>> 
>>>>>>>> Thanks Jonathan for looking into it and for your suggestions.
>>>>>>>> 
>>>>>>>> The reason why I came with a clause rather than a separate
operator
>>>> was
>>>>>>> to avoid adding additional operators to the grammar.
>>>>>>>> So adding ROLLUP, GROUPING_SET will need separate logical
operators
>>>>>>> adding to the complexity. I am planning to keep everything under
cube
>>>>>>> operator, so only LOCube and POCube operators will be added
>>>> additionally.
>>>>>>> And as you and Dmitriy have mentioned the purpose of HAVING clause
is
>>>> the
>>>>>>> same as FILTER so we do not need a separate HAVING clause.
>>>>>>>> 
>>>>>>>> I will give a quick recap of cube related operations and
multiple
>>>> syntax
>>>>>>> options for achieving the same. I am also adding partial cubing
and
>>>> rollup
>>>>>>> in this discussion.
>>>>>>>> 
>>>>>>>> 1) CUBE
>>>>>>>> 
>>>>>>>> Current syntax:
>>>>>>>> alias = CUBE rel BY (a, b);
>>>>>>>> 
>>>>>>>> Following group-by's will be computed:
>>>>>>>> (a, b)
>>>>>>>> (a)
>>>>>>>> (b)
>>>>>>>> ()
>>>>>>>> 
>>>>>>>> 2) Partial CUBE
>>>>>>>> 
>>>>>>>> Proposed syntax:
>>>>>>>> alias = CUBE rel BY a, (b, c);
>>>>>>>> 
>>>>>>>> Following group-by's will be computed:
>>>>>>>> (a, b, c)
>>>>>>>> (a, b)
>>>>>>>> (a, c)
>>>>>>>> (a)
>>>>>>>> 
>>>>>>>> 3) ROLLUP
>>>>>>>> 
>>>>>>>> Proposed syntax 1:
>>>>>>>> alias = CUBE rel BY ROLLUP(a, b);
>>>>>>>> 
>>>>>>>> Proposed syntax 2:
>>>>>>>> alias = CUBE rel BY (a::b);
>>>>>>>> 
>>>>>>>> Proposed syntax 3:
>>>>>>>> alias = ROLLUP rel BY (a, b);
>>>>>>>> 
>>>>>>>> Following group-by's will be computed:
>>>>>>>> (a, b)
>>>>>>>> (a)
>>>>>>>> ()
>>>>>>>> 
>>>>>>>> 4) Partial ROLLUP
>>>>>>>> 
>>>>>>>> Proposed syntax 1:
>>>>>>>> alias = CUBE rel BY a, ROLLUP(b, c);
>>>>>>>> 
>>>>>>>> Proposed syntax 2:
>>>>>>>> alias = CUBE rel BY (a, b::c);
>>>>>>>> 
>>>>>>>> Proposed syntax 3:
>>>>>>>> alias = ROLLUP rel BY a, (b, c);
>>>>>>>> 
>>>>>>>> Following group-by's will be computed:
>>>>>>>> (a, b, c)
>>>>>>>> (a, b)
>>>>>>>> (a)
>>>>>>>> 
>>>>>>>> 5) GROUPING SETS
>>>>>>>> 
>>>>>>>> Proposed syntax 1:
>>>>>>>> alias = CUBE rel BY GROUPING SETS((a), (b, c), (c))
>>>>>>>> 
>>>>>>>> Proposed syntax 2:
>>>>>>>> alias = CUBE rel BY {(a), (b, c), (c)}
>>>>>>>> 
>>>>>>>> Proposed syntax 3:
>>>>>>>> alias = GROUPING_SET rel BY ((a), (b, c), (c))
>>>>>>>> 
>>>>>>>> Following group-by's will be computed:
>>>>>>>> (a)
>>>>>>>> (b, c)
>>>>>>>> (c)
>>>>>>>> 
>>>>>>>> Please vote for syntax 1, 2 or 3 so that we can come to a
consensus
>>>>>>> before I start hacking the grammar file.
>>>>>>>> 
>>>>>>>> Thanks
>>>>>>>> -- Prasanth
>>>>>>>> 
>>>>>>>> On May 29, 2012, at 4:05 PM, Jonathan Coveney wrote:
>>>>>>>> 
>>>>>>>>> Hey Prashanth, happy hacking.
>>>>>>>>> 
>>>>>>>>> My opinion:
>>>>>>>>> 
>>>>>>>>> CUBE:
>>>>>>>>> 
>>>>>>>>> alias = CUBE rel BY (a,b,c);
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> I like that syntax. It's unambiguous what is going on.
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> ROLLUP:
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> alias = CUBE rel BY ROLLUP(a,b,c);
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> I never liked that syntax in SQL. I suggest we just do
what we did
>>>> with
>>>>>>> CUBE. IE
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> alias = ROLLUP rel BY (a,b,c);
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> GROUPING SETS:
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> alias = CUBE rel BY GROUPING SETS((a,b),(b),());
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> I don't like this. The cube vs. grouping sets is confusing
to me.
>>>> maybe
>>>>>>>>> following the
>>>>>>>>> same pattern you could do something like:
>>>>>>>>> 
>>>>>>>>> alias = GROUPING_SET rel BY ((a,b),(b),());
>>>>>>>>> 
>>>>>>>>> As far as having, is there an optimization that can be
done with a
>>>>>>> HAVING
>>>>>>>>> clause that can't be done based on the logical plan that
comes
>>>>>>> afterwards?
>>>>>>>>> That seems odd to me. Since you have to materialize the
result
>>>> anyway,
>>>>>>>>> can't the having clause just be a FILTER that comes after
the
>> cube? I
>>>>>>> don't
>>>>>>>>> know why we need a special syntax.
>>>>>>>>> 
>>>>>>>>> My opinion. Forgive janky formatting, gmail + paste =
pain.
>>>>>>>>> Jon
>>>>>>>>> 
>>>>>>>>> 2012/5/27 Prasanth J <buckeye.prasanth@gmail.com>
>>>>>>>>> 
>>>>>>>>>> Hello everyone
>>>>>>>>>> 
>>>>>>>>>> I am looking for feedback from the community about
the syntax for
>>>>>>>>>> CUBE/ROLLUP/GROUPING SETS operations in pig.
>>>>>>>>>> I am moving the discussion from JIRA to dev-list
so that everyone
>>>> can
>>>>>>>>>> share their opinion for operator syntax. Please have
a look at the
>>>>>>> syntax
>>>>>>>>>> proposal at the link below and let me know your opinion
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>> 
>>>> 
>> https://issues.apache.org/jira/browse/PIG-2167?focusedCommentId=13277644&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13277644
>>>>>>>>>> 
>>>>>>>>>> Thanks
>>>>>>>>>> -- Prasanth
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>> 
>>>> 
>>>> 
>> 
>> 


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message