calcite-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Julian Hyde <jh...@apache.org>
Subject Re: Changes for Grouping Sets in Calcite
Date Wed, 09 Dec 2015 20:42:52 GMT
By “fail fast” I meant discovering rules that did not handle GROUPING_ID properly, without
explicitly writing extra tests. By putting “g” at the start, all key and aggregate columns
would be off-by-one, so it maximizes the chance of finding the problem (usually an assert
due to a datatype mismatch, or data correctness error). If you put “g” after the key columns,
only the aggregate columns will be off.

But since Hive puts it after the key columns, we would also have a greater chance of screwing
up Hive. So I don’t think “fail fast” is such a good idea.

Julian
 

> On Dec 9, 2015, at 4:13 AM, Jesus Camacho Rodriguez <jcamachorodriguez@hortonworks.com>
wrote:
> 
> Thanks for your quick reply Julian.
> 
> Although I do not feel too strongly about one choice or another, I think we could use
a single type, as it will simplify the GROUPING_ID and GROUPING functions implementation.
However, I agree with you that we cannot limit the number of grouping columns. Thus, using
BINARY as you suggested seems a logical choice.
> 
> Wrt Aggregate output, I was thinking on implementing it as [k0, k1, k2, g, agg1, agg2]
: the only reason is that Hive implements it that way. However, I would like to understand
why you think [g, k0, k1, k2, agg1, agg2] could be a better choice, I did not understand it
fully. What do you mean with "fail fast"? And about discovering rules that do not handle GROUPING_ID
properly?
> 
> --
> Jesús
> 
> 
> 
> 
> On 12/8/15, 8:16 PM, "Julian Hyde" <jhyde@apache.org> wrote:
> 
>> I am in favor of this change. The information in one GROUPING_ID column is the same
as the information in n GROUPING(i) columns, but it is easier to manage one column than several.
>> 
>> I screwed up when implementing https://issues.apache.org/jira/browse/CALCITE-370.
It would have been a lot cheaper to fix this if we had realized during the review process
that indicator columns would be difficult to handle.
>> 
>> It is important that we do not use a representation that limits the number of grouping
columns. BIGINT contains 64 bits, and therefore allows 64 grouping columns, but that is not
enough. We could use a representation that allows an unlimited number of bits, say BINARY(ceil(n
/ 8)). Or we could simply state that the client should not assume any particular data type:
for 1 grouping column the type might be TINYINT, for 10,000 grouping columns the type might
be BLOB. The GROUPING_ID function will convert the internal type to a number, and the GROUPING(i)
functions will extract the i’th bit.
>> 
>> We will need to re-organize the output row-type of Aggregate. For an aggregate with
3 group keys [k0, k1, k2], grouping sets (i.e. indicator = true), and 2 aggregate functions
[agg1, agg2], the output is currently [k0, k1, k2, i0, k1, k2, agg1, agg2] where [i0, i1,
i2] are the indicator columns. Should the new output be [k0, k1, k2, g, agg1, agg2] or should
it be [g, k0, k1, k2, agg1, agg2]? (In both cases g is the grouping_id column, which concatenates
the indicator columns into one value.) The latter is more different from the current output
format, but since the grouping_id column appears first it will “fail fast”. We will discover
more quickly that a particular rule is not handling grouping_id properly.
>> 
>> Julian
>> 
>> 
>> 
>>> On Dec 8, 2015, at 5:50 AM, Jesus Camacho Rodriguez <jcamachorodriguez@hortonworks.com>
wrote:
>>> 
>>> Hi,
>>> 
>>> I wanted to open a thread with some conversation about changes in Grouping Sets
implementation in Calcite.
>>> 
>>> Grouping sets are currently implemented in Calcite using a bit to indicate each
of the grouping columns. For instance, consider the following group by clause:
>>> 
>>> ...
>>> GROUP BY CUBE (a, b)
>>> ...
>>> 
>>> The generated Aggregate operator in Calcite will have a row schema consisting
of [a, b, GROUPING(a), GROUPING(b)], where GROUPING(x) represents whether x is participating
in the group by clause. In particular, GROUPING returns 1 for aggregated or 0 for not aggregated
in the result set.
>>> 
>>> In contrast, Hive's implementation stores a single number corresponding to the
GROUPING bit vector associated with a row (this is the result of the GROUPING_ID function
in RDBMS such as MSSQLServer, Oracle, etc). Thus, the row schema of the Aggregate operator
is [a, b, GROUPING_ID(a,b)].
>>> 
>>> This difference is creating a mismatch between Calcite and Hive. Till now, we
have solved it in the Hive side: we created our own GROUPING_ID function applied over those
columns. However, we have some issues related to predicates pushdown, constant propagation,
etc., that we need to continue solving as e.g. new rules are added to our optimizer. In short,
this is making the code on the Hive side harder and harder to maintain.
>>> 
>>> We were wondering if it would make sense to change the implementation on the
Calcite side. Which systems are using Calcite's grouping sets currently? Are you in favor/against
this change? Why?
>>> 
>>> Thanks,
>>> Jesús
>>> 
>> 
>> 


Mime
View raw message