Mailing-List: contact derby-commits-help@db.apache.org; run by ezmlm
Precedence: bulk
Reply-To: "Derby Development" <derby-dev@db.apache.org>
Content-Type: text/plain; charset="us-ascii"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
From: Apache Wiki <wikidiffs@apache.org>
To: derby-commits@db.apache.org
Date: Sat, 04 Aug 2007 16:30:16 -0000
Message-ID: <20070804163016.24741.47510@eos.apache.org>
Subject: [Db-derby Wiki] Update of "OLAPRollupLists" by BryanPendleton

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Db-derby Wiki" for change notification.

The following page has been changed by BryanPendleton:
http://wiki.apache.org/db-derby/OLAPRollupLists

The comment on the change is:
Start new page for ROLLUP

New page:
As part of the ["OLAPOperations"] effort, we'd like to add support for ROLLUP multi-level grouping.

== Background ==

Consider a table of sales transaction data, which tracks information about particular sales:
{{{
create table sales_history (region varchar(10), state char(2), product varchar(10), date_of_sale timestamp, sales int);
}}}
Such a table has one row for each sales transaction, recording the region and state where the sale occurred, the product that was sold, the date on which it was sold, and the amount of the sale.

Now suppose that we'd like to analyze our sales data, to study the amount of sales that is occurring for different products, in different states and regions. Using the ROLLUP feature of SQL 2003, we could issue the query:

{{{
select region, state, product, sum(sales) total_sales
from sales_history 
group by rollup (region, state, product)
}}}

Semantically, the above query is equivalent to

{{{
select region, state, product, sum(sales) total_sales
from sales_history 
group by region, state, product
union
select region, state, null, sum(sales) total_sales
from sales_history 
group by region, state
union
select region, null, null, sum(sales) total_sales
from sales_history 
group by region
union
select null, null, null, sum(sales) total_sales
from sales_history 
}}}

The query might produce results that looked something like:
{{{
REGION    STATE    PRODUCT    TOTAL_SALES
------    -----    -------    -----------
null       null     null        6200
EAST       MA       BOATS       100
EAST       MA       CARS        1500
EAST       MA       null        1600
EAST       NY       BOATS       150
EAST       NY       CARS        1000
EAST       NY       null        1150
EAST       null     null        2750
WEST       CA       BOATS       750
WEST       CA       CARS        500
WEST       CA       null        1250
WEST       AZ       BOATS       2000
WEST       AZ       CARS        200
WEST       AZ       null        2200
WEST       null     null        3450
}}}

We could actually compute the ROLLUP query by transforming it into the equivalent UNION query. However, one of the things that's very interesting about this query is that it can be computed using a '''single''' sort of the data, and a '''single''' pass over that sorted data. So the expense of computing this query is roughly equivalent to the expense of computing the single sub-query

{{{
select region, state, product, sum(sales) total_sales
from sales_history 
group by region, state, product
}}}
because the other aggregation levels can be computed at the same time.

All we have to do is to extend GroupedAggregateResultSet so that instead of computing a single result row at a time, it maintains N pending result rows, where N is the number of levels of ROLLUP in the query (N == 4 in the query above).

So, when GroupedAggregateResultSet is processing a row of data for (region=EAST,state=MA,product=BOATS), it should not only sum the sales for that level of aggregation, but should also at the same time sum the sales for the aggregation levels (region=EAST,state=MA), and (region=EAST), and ().

Then, when GroupedAggregateResultSet encounters the end of the (EAST,MA,BOATS) data, and starts to encounter (EAST,MA,CARS) data, it should finalize the result row for (EAST,MA,BOATS) and start working on computing the result row for (EAST,MA,CARS), but it should simply continue aggregating the result rows for (EAST,MA), (EAST), and ().

And when GroupedAggregateResultSet encounters the end of the (EAST,MA,CARS) data, and starts to encounter (EAST,NY,BOATS) data,it should finalize the result row for (EAST,MA,CARS), and it should also finalize the
result row for (EAST,MA), and it should start working on computing the result row for (EAST,NY,BOATS), but it should continue aggregating the result rows for (EAST) and for ().

== Syntax ==

In the SQL 2003 standard, the ROLLUP keyword impacts the following portions of the SQL grammar:

{{{
<grouping element> ::=
    <ordinary grouping set>
  | <rollup list>
  | <cube list>
  | <grouping sets specification>
  | <empty grouping set>

<rollup list> ::=
    ROLLUP <left paren> <ordinary grouping set list> <right paren>

<grouping set> ::=
    <ordinary grouping set>
  | <rollup list>
  | <cube list>
  | <grouping sets specification>
  | <empty grouping set>
}}}
The above grammar is just a tiny subset of the overall GROUP BY grammar, namely those parts that involve ROLLUP.