db-derby-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Db-derby Wiki] Update of "OLAPRollupLists" by BryanPendleton
Date Tue, 21 Aug 2007 21:47:19 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Db-derby Wiki" for change notification.

The following page has been changed by BryanPendleton:
http://wiki.apache.org/db-derby/OLAPRollupLists

The comment on the change is:
Rewrite to make alternate algorithms more clear

------------------------------------------------------------------------------
  WEST       null     null        3450
  }}}
  
- We could actually compute the ROLLUP query by transforming it into the equivalent UNION
query. However, one of the things that's very interesting about this query is that it can
be computed using a '''single''' sort of the data, and a '''single''' pass over that sorted
data. So the expense of computing this query is roughly equivalent to the expense of computing
the single sub-query
+ We could actually compute the ROLLUP query by transforming it into the equivalent UNION
query. However, we should seek to find an implementation in which the grouped aggregates can
be computed using a minimum number of sorts of the data, and a minimum number of passes over
the input data. This is discussed in more detail later on this page, but our goal should be
that the expense of computing this query is roughly equivalent to the expense of computing
the single sub-query
  
  {{{
  select region, state, product, sum(sales) total_sales
@@ -66, +66 @@

  group by region, state, product
  }}}
  because the other aggregation levels can be computed at the same time.
- 
- All we have to do is to extend GroupedAggregateResultSet so that instead of computing a
single result row at a time, it maintains N pending result rows, where N is the number of
levels of ROLLUP in the query (N == 4 in the query above).
- 
- So, when GroupedAggregateResultSet is processing a row of data for (region=EAST,state=MA,product=BOATS),
it should not only sum the sales for that level of aggregation, but should also at the same
time sum the sales for the aggregation levels (region=EAST,state=MA), and (region=EAST), and
().
- 
- Then, when GroupedAggregateResultSet encounters the end of the (EAST,MA,BOATS) data, and
starts to encounter (EAST,MA,CARS) data, it should finalize the result row for (EAST,MA,BOATS)
and start working on computing the result row for (EAST,MA,CARS), but it should simply continue
aggregating the result rows for (EAST,MA), (EAST), and ().
- 
- And when GroupedAggregateResultSet encounters the end of the (EAST,MA,CARS) data, and starts
to encounter (EAST,NY,BOATS) data,it should finalize the result row for (EAST,MA,CARS), and
it should also finalize the
- result row for (EAST,MA), and it should start working on computing the result row for (EAST,NY,BOATS),
but it should continue aggregating the result rows for (EAST) and for ().
  
  == Syntax ==
  
@@ -106, +97 @@

  
   * Determine the rows to be grouped, by executing the indicated query
   * For each row, augment the row with additional hidden columns which hold information about
the aggregation to be performed (SUM, MAX, COUNT, etc.)
+  * Process the input row set, using either 
+    * inline aggregation if the data is already in sorted order (or has to be sorted in order
to determine the DISTINCT values), or
+    * duplicate-eliminating sort aggregation
+  * return the resulting grouped aggregates.
+ 
+ The grouped aggregate processing occurs in GroupedAggregateResultSet.java
+ 
+ == Techniques for Implementing Grouped Aggregates ==
+ 
+ It seems that there are basically 3 ways to perform grouped aggregation:
+  1. Inline Aggregation
+  1. Duplicate-eliminating sort with sort-observer aggregation
+  1. Duplicate-eliminating hash with hash-collision aggregation
+ 
+ There also appear to be numerous ways to combine these three variants, which we refer to
as "hybrid" algorithms below.
+ 
+ === Inline Aggregation ===
+ 
+ If the records to be grouped are already in sorted order, perhaps due to a scan of an index,
or a previous merge-sort-join operation, or a duplicate-eliminating distinct sort, then the
grouping operation can be implemented by comparing each pair of records on the grouping attributes;
each time a new value of the grouping attributes is encountered, the computation of the previous
group can be finalized and emitted. 
+ 
+ This algorithm  is currently implemented by Derby for a single set of grouping attributes;
it seems possible to extend this logic for a ROLLUP set of groups, as follows:
+ 
+ === Duplicate-eliminating Sort with Sort Observer ===
+ 
+ An unsorted set of records can be grouped by feeding them into the sorter, configured for
duplicate elimination, with a sort observer which aggregates each row's values into the retained
row's columns during duplicate processing. Each such sort groups the records by a specific
set of grouping attributes. 
+ 
+ Thus we:
   * Sort the rows by the grouping columns, eliminating duplicates during the sort
   * Each time a duplicate row is eliminated, aggregate its information into the row which
the sorter retains
   * At the end of the sort, there is one remaining row for each distinct set of values of
the grouping columns, and the augmented aggregation columns in that row hold the computed
aggregate values.
  
- Thus aggregate computation is a side-effect of sorting and duplicate elimination.
+ Thus aggregate computation is a side-effect of sorting and duplicate elimination. Note that
as a side effect of using the sorter for this purpose, the result set is emitted in ORDER
BY the grouping attributes.
  
+ This algorithm is currently implemented by Derby for a single set of grouping attributes;
it seems possible to extend this logic for implementing a ROLLUP set of groups, as follows:
+  * Use N sorts to group the records by N different sets of grouping attributes. 
+  * The input rows are read only once, but then are added to each of the N sorts.
+ 
+ This should be substantially more efficient than the alternate query-rewriting algorithm
of constructing N GROUP BY statements and then UNION'ing their results together, since in
this implementation the input rows are materialized and read only once.
+ 
+ However, if multiple sorts are used to compute multiple grouping sets, whether ROLLUP or
unrelated, the overall result set may not necessarily be emitted in ORDER BY the grouping
attributes.
+ 
+ This algorithm also seems like it could be extended to implement other types of GROUPING
SETs and CUBE aggregation, since each sort can independently group the records by a different
set of unrelated grouping attributes.
+ 
+ === Hash-based Grouped Aggregation ===
+ 
+ An unsorted set of records can also be grouped by feeding them into a hash table, whose
keys are the grouping attributes, with a hash-collision observer which aggregates each row's
values into the retained row's columns during duplicate key elimination. This is fundamentally
similar to the sort-based technique, with the important difference that at the end of the
processing, the rows will NOT be emitted in sorted order.
+ 
+ I am not aware of any code which implements a hash-based technique for grouped aggregates
in Derby.
+ 
+ === Hybrid Algorithms ===
+ 
+ The inline-aggreation and sort-based techniques  can be combined into an algorithm which
computes the N ROLLUP groups using a single sort, by first sorting (and aggregating) the rows
on their finest level of grouping detail. 
+ 
+ All we have to do is to extend GroupedAggregateResultSet so that instead of computing a
single result row at a time, it maintains N pending result rows, where N is the number of
levels of ROLLUP in the query (N == 4 in the query above).
+ 
+ So, when GroupedAggregateResultSet is processing a row of data for (region=EAST,state=MA,product=BOATS),
it should not only sum the sales for that level of aggregation, but should also at the same
time sum the sales for the aggregation levels (region=EAST,state=MA), and (region=EAST), and
().
+ 
+ Then, when GroupedAggregateResultSet encounters the end of the (EAST,MA,BOATS) data, and
starts to encounter (EAST,MA,CARS) data, it should finalize the result row for (EAST,MA,BOATS)
and start working on computing the result row for (EAST,MA,CARS), but it should simply continue
aggregating the result rows for (EAST,MA), (EAST), and ().
+ 
+ And when GroupedAggregateResultSet encounters the end of the (EAST,MA,CARS) data, and starts
to encounter (EAST,NY,BOATS) data,it should finalize the result row for (EAST,MA,CARS), and
it should also finalize the
+ result row for (EAST,MA), and it should start working on computing the result row for (EAST,NY,BOATS),
but it should continue aggregating the result rows for (EAST) and for ().
+ 

Mime
View raw message