I'm not familiar with the Derby query engine, so I can't comment on the specifics.  I know that when I implemented the equivalent enhancement in the MetaMatrix query engine that it was fairly straightforward to add expression capability while grouping rather than rewrite as a subquery.  The main thing was to handle expression matching in the HAVING selection and SELECT projection above the grouping to detect columns that were outputs from the GROUP BY and not expressions to be evaluated. 
 
We also have done some interesting optimizations with partial aggregate pushdowns, particularly in multi-database queries that rely pretty heavily on having grouping expressions, so this may open up some new optimization options for you as well.  Partial aggregate pushdown can actually push an aggregate over a join through the join and do a partial pre-aggregation below the join.  In Derby, it's hard to say whether this is a big deal but for us it's crucial in making aggregates over multi-database joins efficient. 
 
An example is something like:
 
SELECT SUM(o.O_TOTALPRICE ), n.N_NAME 
FROM db_a.ORDERS o, db_b.CUSTOMER c, db_b.NATION n 
WHERE (c.C_CUSTKEY=o.O_CUSTKEY) AND  (n.N_NATIONKEY=c.C_NATIONKEY) 
GROUP BY n.N_NAME 
 
which without partial aggregation, would cause source queries like:
 
db_a: SELECT O_CUSTKEY, O_TOTALPRICE FROM ORDERS
db_b: SELECT C_CUSTKEY, C_NATIONKEY, N_NAME FROM CUSTOMER, NATION
 
and the join and aggregation must occur inside the engine (which requires reading all of these 3 tables from the source).  Bad stuff.
 
With partial aggregation, we're able to push the aggregate partially through the join and aggregate on customer in db_a, then re-aggregate on nation in the engine:
 
db_a: SELECT O_CUSTKEY, SUM(O_TOTALPRICE) FROM ORDERS GROUP BY O_CUSTKEY
db_b: same
 
This significantly reduces the data retrieved from the ORDERS table (the biggest), by whatever the fanout is from customer to order.  This is still not necessarily that compelling an example, but there are many other common queries where the difference is even more significant.  Turns out that the tricky thing is that sometimes partial aggregation is *worse* if the additional aggregation is effectively not reducing the fanout at all (or much).  In that case, you still retrieve just as much data but you do aggregation in both the db and the engine, so it ends up being slower than the original.  This is pretty easily detected with some minimal costing information.  Of course, there are also limitations on when this can be used but they still cover a good portion of the real set of queries.
 
Alex Miller
Chief Architect
MetaMatrix


From: Manish Khettry [mailto:manish.khettry@gmail.com]
Sent: Saturday, May 20, 2006 6:17 PM
To: derby-dev@db.apache.org
Subject: Re: [jira] Commented: (DERBY-883) Enhance GROUP BY clause to support expressions instead of just column references.

Thanks Alex and Satheesh, that does clarify things quite a bit.

I spent some time looking at the GroupedAggregateResultSet and I have the feeling that atleast on the execution side of things not much will have to change. A grouped aggregate RS already sits on top of a ProjectRestrictRS and if we can have the prRS evaluate the expressions we are grouping on, the existing logic should work without any (or perhaps minimal) changes. It is on the compilation side of things that this is, err rather hairy :) Someone should correct me if I'm missing something here or simplifying things.

The other approach is to rewrite the query to use a select subquery like Satheesh suggested. Perhaps this is a better way to go. Any thoughts on which approach is better?


On 5/18/06, Alex Miller <amiller@metamatrix.com> wrote:
No, we don't go quite that far.  Because most dbs support group by expressions, we haven't needed to implement this. 


From: Satheesh Bandaram [mailto:bandaram@gmail.com]
Sent: Thursday, May 18, 2006 2:16 PM

To: derby-dev@db.apache.org
Subject: Re: [jira] Commented: (DERBY-883) Enhance GROUP BY clause to support expressions instead of just column references.

Thanks Alex, for your message and the blogs... It was an interesting reading. Wonder if MetaMatrix rewrites queries with GROUP BY expressions for Derby into SELECT subqueries... something like:

select year(hiredate), month(hiredate) as month, count(*)  from employees
group by year(hiredate), month(hiredate);

INTO something like:

select years, months, count(*) from (select year(hiredate), month(hiredate) from employees) emp (years,months) group by years, months;

This could be used as a workaround until this functionality is added to Derby.

Satheesh

On 5/18/06, Alex Miller < amiller@metamatrix.com> wrote:
I blogged about GROUP BY expression support across dbs a while back
( http://tech.puredanger.com/2005/03/02/fun-with-expressions-in-a-group-b
y/ and
http://tech.puredanger.com/2005/04/01/update-on-expressions-in-group-by/
).  Pretty much all the major vendors support expressions in a GROUP BY.
I implemented this functionality a while back for the MetaMatrix query
engine and blogged about using MetaMatrix Query to "add" this
functionality over Derby in
http://devcentral.metamatrix.com/blog/alex/2006/02/28/Enhancing-Apache-D
erby-with-MetaMatrix.

Positional parameters don't make as much sense for GROUP BY as they do
for ORDER BY.  Logically, you're executing the clauses in the order
FROM-WHERE-GROUP BY-HAVING-SELECT-ORDER BY.  The SELECT clause is the
point at which the output columns of the query are effectively named and
ordered.  Positional parameters make sense in ORDER BY because they
refer to the output columns of the previous phase, which are defined as
part of the query, but may have no well-defined name to refer to them
with.

In the case of GROUP BY, you would be referring backwards from the
SELECT clause to the GROUP BY clause, so that seems kind of goofy.  The
only reason to do this would be to avoid referencing a complex unnamed
expression.  This is, of course, exactly the sort of thing that db
vendors bend the rules about to make SQL more usable.  In fact, you can
use GROUP BY positional parameters in MySQL and Postgres but not in any
major commercial db that I've tried (Oracle, DB2, SQL Server, Sybase).

Alex Miller
Chief Architect
MetaMatrix

-----Original Message-----
From: Satheesh Bandaram (JIRA) [mailto:derby-dev@db.apache.org]
Sent: Wednesday, May 17, 2006 1:14 PM
To: derby-dev@db.apache.org
Subject: [jira] Commented: (DERBY-883) Enhance GROUP BY clause to
support expressions instead of just column references.

    [
http://issues.apache.org/jira/browse/DERBY-883?page=comments#action_1241
2227 ]

Satheesh Bandaram commented on DERBY-883:
-----------------------------------------

Thanks for making progress on this important new functionality, Manish.

> 1. In terms of syntax, do we allow expressions in the group by list or
positional parameters, or both?
>
> select tomonth(creationdt), toyear(creationdt), count(*) from bugs
> group by 1, 2;

I have seen positional parameters for ORDER BY expressions, not
typically used in GROUP BY. Looking at both DB2 and Oracle
documentation, it seems neither support positional parameters.

> An implementation question on this note-- does the language code have
> a way of looking at two expressions (ValueNode?) and checking to see
> if they are equivalent? We'll need some way of doing this to match an
> expression in the group by list to an expression in the select list
right?

Correct. Don't think there is any existing expression matching to
compare two expressions. DB2 docs discuss how group by expressions are
matched in SQL reference manual. (Page 484:
ftp://ftp.software.ibm.com/ps/products/db2/info/vr82/pdf/en_US/db2s1e81 .
pdf)

> 2. I assume that an expression in a group by list must appear in the
> select list without aggregation right? Is this an error?
>
> select x+1, x+2, sum(y)
> from test
> group by x

NO... This is a valid query. See the reference I provided above.

> 3. What do we do with duplicates? i.e.
>
> select x+1, x+1, sum(y)
> from test
> group by x+1, x+1;
>
> Is this an error? The current implementation throws an error if the
> same column occurs more than once in the group by list.

I am not sure why Derby currently considers this an error... Looking at
the code, it seems it may be looking for ambiguous column references
(like 'x' being part of two different tables in from_list), which makes
sense, but not sure why duplicate references should be prevented.

> Is there a standard somewhere which I should consult before trying to
nail down the functionality?

Unfortunately, NO.... SQL 2003 seems to allow only column references in
GROUP BY clause. But both DB2 and Oracle allow expressions in GROUP BY
list and likely allowed by other database vendors too. You could use
either DB2 or Oracle docs to understand how this functionality is
defined there. Much easier to read these docs than confusing SQL 2003
spec.


> Enhance GROUP BY clause to support expressions instead of just column
references.
> ----------------------------------------------------------------------
> -----------
>
>          Key: DERBY-883
>          URL: http://issues.apache.org/jira/browse/DERBY-883
>      Project: Derby
>         Type: New Feature

>   Components: SQL
>     Versions: 10.1.2.1
>  Environment: JDK 1.5.0_05
>     Reporter: Lluis Turro
>     Assignee: Manish Khettry

>
> This select would return an error syntax on finding "(" after month if
group by clause:
> select idissue, month(creation), year(creation), count(distinct
> idissue) where
>   ....
> group by idissue, month(creation), year(creation)

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira