Dear Wiki user,
You have subscribed to a wiki page or wiki category on "Dbderby Wiki" for change notification.
The following page has been changed by BryanPendleton:
http://wiki.apache.org/dbderby/OLAPRollupLists
The comment on the change is:
Rewrite to make alternate algorithms more clear

WEST null null 3450
}}}
 We could actually compute the ROLLUP query by transforming it into the equivalent UNION query. However, one of the things that's very interesting about this query is that it can be computed using a '''single''' sort of the data, and a '''single''' pass over that sorted data. So the expense of computing this query is roughly equivalent to the expense of computing the single subquery
+ We could actually compute the ROLLUP query by transforming it into the equivalent UNION query. However, we should seek to find an implementation in which the grouped aggregates can be computed using a minimum number of sorts of the data, and a minimum number of passes over the input data. This is discussed in more detail later on this page, but our goal should be that the expense of computing this query is roughly equivalent to the expense of computing the single subquery
{{{
select region, state, product, sum(sales) total_sales
@@ 66, +66 @@
group by region, state, product
}}}
because the other aggregation levels can be computed at the same time.

 All we have to do is to extend GroupedAggregateResultSet so that instead of computing a single result row at a time, it maintains N pending result rows, where N is the number of levels of ROLLUP in the query (N == 4 in the query above).

 So, when GroupedAggregateResultSet is processing a row of data for (region=EAST,state=MA,product=BOATS), it should not only sum the sales for that level of aggregation, but should also at the same time sum the sales for the aggregation levels (region=EAST,state=MA), and (region=EAST), and ().

 Then, when GroupedAggregateResultSet encounters the end of the (EAST,MA,BOATS) data, and starts to encounter (EAST,MA,CARS) data, it should finalize the result row for (EAST,MA,BOATS) and start working on computing the result row for (EAST,MA,CARS), but it should simply continue aggregating the result rows for (EAST,MA), (EAST), and ().

 And when GroupedAggregateResultSet encounters the end of the (EAST,MA,CARS) data, and starts to encounter (EAST,NY,BOATS) data,it should finalize the result row for (EAST,MA,CARS), and it should also finalize the
 result row for (EAST,MA), and it should start working on computing the result row for (EAST,NY,BOATS), but it should continue aggregating the result rows for (EAST) and for ().
== Syntax ==
@@ 106, +97 @@
* Determine the rows to be grouped, by executing the indicated query
* For each row, augment the row with additional hidden columns which hold information about the aggregation to be performed (SUM, MAX, COUNT, etc.)
+ * Process the input row set, using either
+ * inline aggregation if the data is already in sorted order (or has to be sorted in order to determine the DISTINCT values), or
+ * duplicateeliminating sort aggregation
+ * return the resulting grouped aggregates.
+
+ The grouped aggregate processing occurs in GroupedAggregateResultSet.java
+
+ == Techniques for Implementing Grouped Aggregates ==
+
+ It seems that there are basically 3 ways to perform grouped aggregation:
+ 1. Inline Aggregation
+ 1. Duplicateeliminating sort with sortobserver aggregation
+ 1. Duplicateeliminating hash with hashcollision aggregation
+
+ There also appear to be numerous ways to combine these three variants, which we refer to as "hybrid" algorithms below.
+
+ === Inline Aggregation ===
+
+ If the records to be grouped are already in sorted order, perhaps due to a scan of an index, or a previous mergesortjoin operation, or a duplicateeliminating distinct sort, then the grouping operation can be implemented by comparing each pair of records on the grouping attributes; each time a new value of the grouping attributes is encountered, the computation of the previous group can be finalized and emitted.
+
+ This algorithm is currently implemented by Derby for a single set of grouping attributes; it seems possible to extend this logic for a ROLLUP set of groups, as follows:
+
+ === Duplicateeliminating Sort with Sort Observer ===
+
+ An unsorted set of records can be grouped by feeding them into the sorter, configured for duplicate elimination, with a sort observer which aggregates each row's values into the retained row's columns during duplicate processing. Each such sort groups the records by a specific set of grouping attributes.
+
+ Thus we:
* Sort the rows by the grouping columns, eliminating duplicates during the sort
* Each time a duplicate row is eliminated, aggregate its information into the row which the sorter retains
* At the end of the sort, there is one remaining row for each distinct set of values of the grouping columns, and the augmented aggregation columns in that row hold the computed aggregate values.
 Thus aggregate computation is a sideeffect of sorting and duplicate elimination.
+ Thus aggregate computation is a sideeffect of sorting and duplicate elimination. Note that as a side effect of using the sorter for this purpose, the result set is emitted in ORDER BY the grouping attributes.
+ This algorithm is currently implemented by Derby for a single set of grouping attributes; it seems possible to extend this logic for implementing a ROLLUP set of groups, as follows:
+ * Use N sorts to group the records by N different sets of grouping attributes.
+ * The input rows are read only once, but then are added to each of the N sorts.
+
+ This should be substantially more efficient than the alternate queryrewriting algorithm of constructing N GROUP BY statements and then UNION'ing their results together, since in this implementation the input rows are materialized and read only once.
+
+ However, if multiple sorts are used to compute multiple grouping sets, whether ROLLUP or unrelated, the overall result set may not necessarily be emitted in ORDER BY the grouping attributes.
+
+ This algorithm also seems like it could be extended to implement other types of GROUPING SETs and CUBE aggregation, since each sort can independently group the records by a different set of unrelated grouping attributes.
+
+ === Hashbased Grouped Aggregation ===
+
+ An unsorted set of records can also be grouped by feeding them into a hash table, whose keys are the grouping attributes, with a hashcollision observer which aggregates each row's values into the retained row's columns during duplicate key elimination. This is fundamentally similar to the sortbased technique, with the important difference that at the end of the processing, the rows will NOT be emitted in sorted order.
+
+ I am not aware of any code which implements a hashbased technique for grouped aggregates in Derby.
+
+ === Hybrid Algorithms ===
+
+ The inlineaggreation and sortbased techniques can be combined into an algorithm which computes the N ROLLUP groups using a single sort, by first sorting (and aggregating) the rows on their finest level of grouping detail.
+
+ All we have to do is to extend GroupedAggregateResultSet so that instead of computing a single result row at a time, it maintains N pending result rows, where N is the number of levels of ROLLUP in the query (N == 4 in the query above).
+
+ So, when GroupedAggregateResultSet is processing a row of data for (region=EAST,state=MA,product=BOATS), it should not only sum the sales for that level of aggregation, but should also at the same time sum the sales for the aggregation levels (region=EAST,state=MA), and (region=EAST), and ().
+
+ Then, when GroupedAggregateResultSet encounters the end of the (EAST,MA,BOATS) data, and starts to encounter (EAST,MA,CARS) data, it should finalize the result row for (EAST,MA,BOATS) and start working on computing the result row for (EAST,MA,CARS), but it should simply continue aggregating the result rows for (EAST,MA), (EAST), and ().
+
+ And when GroupedAggregateResultSet encounters the end of the (EAST,MA,CARS) data, and starts to encounter (EAST,NY,BOATS) data,it should finalize the result row for (EAST,MA,CARS), and it should also finalize the
+ result row for (EAST,MA), and it should start working on computing the result row for (EAST,NY,BOATS), but it should continue aggregating the result rows for (EAST) and for ().
+