pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alan Gates <ga...@hortonworks.com>
Subject Re: CUBE/ROLLUP/GROUPING SETS syntax
Date Wed, 30 May 2012 20:42:41 GMT

On May 30, 2012, at 10:43 AM, Jonathan Coveney wrote:

> I was going to say the same thing Alan said w.r.t. operators: operators in
> the grammar can correspond to whatever logical and physical operators you
> want.
> 
> As far as the principle of least astonishment compared to SQL... Pig is
> already pretty astonishing. I don't know why we would bend over backwards
> to make the syntax so similar in this case when even getting to the point
> of doing a CUBE means understanding an object model that is pretty
> different from SQL.
> 
> On that note,
> 
> rel = CUBE rel BY GROUPING SETS(cols);
> 
> seems really confusing. I'd much rather overload the group operating than
> the cube operator. If I see "cube," I expect a cube. If you start doing
> rollups etc, that's not a cube, it's a group. Or it's just a rollup. Pig
> latin is simple enough that I don't think having a rollup, group_set, etc
> operator will be so confusing, because they're already going to be typing
> that stuff in the conext of
> 
> group rel by rollup(cols); and so on. I don't see how it's worth adding
> more, confusing syntax for the sake of creating parallels with a language
> we now share very little with.

Fair points.

> 
> But I won't beat it any further... if people prefer a different syntax,
> that's fine. Just excited to have the features in Pig!
+1, I can live with any of the 3 syntax choices (near SQL, original, and Jon's).

Alan.

> Jon
> 
> 2012/5/30 Alan Gates <gates@hortonworks.com>
> 
>> Some thoughts on this:
>> 
>> 1) +1 to what Dmitriy said on HAVING
>> 
>> 2) We need to be clear about separating operators in the grammar versus
>> logical plan versus physical plan.  The choices you make in the grammar are
>> totally independent of the other two.  That is, you could choose the syntax:
>> 
>> rel = GROUP rel BY CUBE (a, b, c)
>> 
>> and still have a separate POCube operator.  When the parser sees GROUP BY
>> CUBE it will generate an LOCube operator for the logical plan rather than
>> an LOGroup operator.  You can still have a separate POCube physical
>> operator.  Separate optimizations can be applied to LOGroup vs. LOCube and
>> POGroup vs. POCube.
>> 
>> 3) On syntax I can see arguments for keeping as close to SQL as possible
>> and for the syntax proposed by Prasanth.  The argument for sticking close
>> to SQL is it conforms to the law of least astonishment.  It wouldn't be
>> exactly SQL, as it would end up looking like:
>> 
>> rel = GROUP rel BY CUBE (cols)
>> rel = GROUP rel BY ROLLUP (cols)
>> rel = GROUP rel BY GROUPING SETS(cols);
>> 
>> The argument I see for sticking with Prasanth's approach is that GROUP is
>> really short for COGROUP in Pig Latin, and I don't think we're proposing
>> doing COGROUP rel BY CUBE, nor can I see a case where you'd want to do such
>> a thing.  This makes CUBE really a separate operation.  But if we go this
>> route I agree with Prasanth we should do CUBE rel BY ROLLUP and CUBE rel BY
>> GROUPING SETS.  Let's not proliferate operators.
>> 
>> Alan.
>> 
>> On May 29, 2012, at 3:55 PM, Prasanth J wrote:
>> 
>>> Thanks Jonathan for looking into it and for your suggestions.
>>> 
>>> The reason why I came with a clause rather than a separate operator was
>> to avoid adding additional operators to the grammar.
>>> So adding ROLLUP, GROUPING_SET will need separate logical operators
>> adding to the complexity. I am planning to keep everything under cube
>> operator, so only LOCube and POCube operators will be added additionally.
>> And as you and Dmitriy have mentioned the purpose of HAVING clause is the
>> same as FILTER so we do not need a separate HAVING clause.
>>> 
>>> I will give a quick recap of cube related operations and multiple syntax
>> options for achieving the same. I am also adding partial cubing and rollup
>> in this discussion.
>>> 
>>> 1) CUBE
>>> 
>>> Current syntax:
>>> alias = CUBE rel BY (a, b);
>>> 
>>> Following group-by's will be computed:
>>> (a, b)
>>> (a)
>>> (b)
>>> ()
>>> 
>>> 2) Partial CUBE
>>> 
>>> Proposed syntax:
>>> alias = CUBE rel BY a, (b, c);
>>> 
>>> Following group-by's will be computed:
>>> (a, b, c)
>>> (a, b)
>>> (a, c)
>>> (a)
>>> 
>>> 3) ROLLUP
>>> 
>>> Proposed syntax 1:
>>> alias = CUBE rel BY ROLLUP(a, b);
>>> 
>>> Proposed syntax 2:
>>> alias = CUBE rel BY (a::b);
>>> 
>>> Proposed syntax 3:
>>> alias = ROLLUP rel BY (a, b);
>>> 
>>> Following group-by's will be computed:
>>> (a, b)
>>> (a)
>>> ()
>>> 
>>> 4) Partial ROLLUP
>>> 
>>> Proposed syntax 1:
>>> alias = CUBE rel BY a, ROLLUP(b, c);
>>> 
>>> Proposed syntax 2:
>>> alias = CUBE rel BY (a, b::c);
>>> 
>>> Proposed syntax 3:
>>> alias = ROLLUP rel BY a, (b, c);
>>> 
>>> Following group-by's will be computed:
>>> (a, b, c)
>>> (a, b)
>>> (a)
>>> 
>>> 5) GROUPING SETS
>>> 
>>> Proposed syntax 1:
>>> alias = CUBE rel BY GROUPING SETS((a), (b, c), (c))
>>> 
>>> Proposed syntax 2:
>>> alias = CUBE rel BY {(a), (b, c), (c)}
>>> 
>>> Proposed syntax 3:
>>> alias = GROUPING_SET rel BY ((a), (b, c), (c))
>>> 
>>> Following group-by's will be computed:
>>> (a)
>>> (b, c)
>>> (c)
>>> 
>>> Please vote for syntax 1, 2 or 3 so that we can come to a consensus
>> before I start hacking the grammar file.
>>> 
>>> Thanks
>>> -- Prasanth
>>> 
>>> On May 29, 2012, at 4:05 PM, Jonathan Coveney wrote:
>>> 
>>>> Hey Prashanth, happy hacking.
>>>> 
>>>> My opinion:
>>>> 
>>>> CUBE:
>>>> 
>>>> alias = CUBE rel BY (a,b,c);
>>>> 
>>>> 
>>>> I like that syntax. It's unambiguous what is going on.
>>>> 
>>>> 
>>>> ROLLUP:
>>>> 
>>>> 
>>>> alias = CUBE rel BY ROLLUP(a,b,c);
>>>> 
>>>> 
>>>> I never liked that syntax in SQL. I suggest we just do what we did with
>> CUBE. IE
>>>> 
>>>> 
>>>> alias = ROLLUP rel BY (a,b,c);
>>>> 
>>>> 
>>>> GROUPING SETS:
>>>> 
>>>> 
>>>> alias = CUBE rel BY GROUPING SETS((a,b),(b),());
>>>> 
>>>> 
>>>> I don't like this. The cube vs. grouping sets is confusing to me. maybe
>>>> following the
>>>> same pattern you could do something like:
>>>> 
>>>> alias = GROUPING_SET rel BY ((a,b),(b),());
>>>> 
>>>> As far as having, is there an optimization that can be done with a
>> HAVING
>>>> clause that can't be done based on the logical plan that comes
>> afterwards?
>>>> That seems odd to me. Since you have to materialize the result anyway,
>>>> can't the having clause just be a FILTER that comes after the cube? I
>> don't
>>>> know why we need a special syntax.
>>>> 
>>>> My opinion. Forgive janky formatting, gmail + paste = pain.
>>>> Jon
>>>> 
>>>> 2012/5/27 Prasanth J <buckeye.prasanth@gmail.com>
>>>> 
>>>>> Hello everyone
>>>>> 
>>>>> I am looking for feedback from the community about the syntax for
>>>>> CUBE/ROLLUP/GROUPING SETS operations in pig.
>>>>> I am moving the discussion from JIRA to dev-list so that everyone can
>>>>> share their opinion for operator syntax. Please have a look at the
>> syntax
>>>>> proposal at the link below and let me know your opinion
>>>>> 
>>>>> 
>>>>> 
>> https://issues.apache.org/jira/browse/PIG-2167?focusedCommentId=13277644&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13277644
>>>>> 
>>>>> Thanks
>>>>> -- Prasanth
>>>>> 
>>>>> 
>>> 
>> 
>> 


Mime
View raw message