db-derby-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Alex Miller" <amil...@metamatrix.com>
Subject RE: [jira] Commented: (DERBY-883) Enhance GROUP BY clause to support expressions instead of just column references.
Date Mon, 22 May 2006 03:52:58 GMT
I'm not familiar with the Derby query engine, so I can't comment on the
specifics.  I know that when I implemented the equivalent enhancement in
the MetaMatrix query engine that it was fairly straightforward to add
expression capability while grouping rather than rewrite as a subquery.
The main thing was to handle expression matching in the HAVING selection
and SELECT projection above the grouping to detect columns that were
outputs from the GROUP BY and not expressions to be evaluated.  
 
We also have done some interesting optimizations with partial aggregate
pushdowns, particularly in multi-database queries that rely pretty
heavily on having grouping expressions, so this may open up some new
optimization options for you as well.  Partial aggregate pushdown can
actually push an aggregate over a join through the join and do a partial
pre-aggregation below the join.  In Derby, it's hard to say whether this
is a big deal but for us it's crucial in making aggregates over
multi-database joins efficient.  
 
An example is something like:
 
SELECT SUM(o.O_TOTALPRICE ), n.N_NAME 
FROM db_a.ORDERS o, db_b.CUSTOMER c, db_b.NATION n 
WHERE (c.C_CUSTKEY=o.O_CUSTKEY) AND  (n.N_NATIONKEY=c.C_NATIONKEY) 
GROUP BY n.N_NAME 
 
which without partial aggregation, would cause source queries like:
 
db_a: SELECT O_CUSTKEY, O_TOTALPRICE FROM ORDERS
db_b: SELECT C_CUSTKEY, C_NATIONKEY, N_NAME FROM CUSTOMER, NATION
 
and the join and aggregation must occur inside the engine (which
requires reading all of these 3 tables from the source).  Bad stuff.
 
With partial aggregation, we're able to push the aggregate partially
through the join and aggregate on customer in db_a, then re-aggregate on
nation in the engine:
 
db_a: SELECT O_CUSTKEY, SUM(O_TOTALPRICE) FROM ORDERS GROUP BY O_CUSTKEY

db_b: same
 
This significantly reduces the data retrieved from the ORDERS table (the
biggest), by whatever the fanout is from customer to order.  This is
still not necessarily that compelling an example, but there are many
other common queries where the difference is even more significant.
Turns out that the tricky thing is that sometimes partial aggregation is
*worse* if the additional aggregation is effectively not reducing the
fanout at all (or much).  In that case, you still retrieve just as much
data but you do aggregation in both the db and the engine, so it ends up
being slower than the original.  This is pretty easily detected with
some minimal costing information.  Of course, there are also limitations
on when this can be used but they still cover a good portion of the real
set of queries.
 
Alex Miller
Chief Architect
MetaMatrix

________________________________

From: Manish Khettry [mailto:manish.khettry@gmail.com] 
Sent: Saturday, May 20, 2006 6:17 PM
To: derby-dev@db.apache.org
Subject: Re: [jira] Commented: (DERBY-883) Enhance GROUP BY clause to
support expressions instead of just column references.


Thanks Alex and Satheesh, that does clarify things quite a bit. 

I spent some time looking at the GroupedAggregateResultSet and I have
the feeling that atleast on the execution side of things not much will
have to change. A grouped aggregate RS already sits on top of a
ProjectRestrictRS and if we can have the prRS evaluate the expressions
we are grouping on, the existing logic should work without any (or
perhaps minimal) changes. It is on the compilation side of things that
this is, err rather hairy :) Someone should correct me if I'm missing
something here or simplifying things. 

The other approach is to rewrite the query to use a select subquery like
Satheesh suggested. Perhaps this is a better way to go. Any thoughts on
which approach is better?



On 5/18/06, Alex Miller <amiller@metamatrix.com> wrote: 

	No, we don't go quite that far.  Because most dbs support group
by expressions, we haven't needed to implement this.  

________________________________

	From: Satheesh Bandaram [mailto:bandaram@gmail.com] 
	Sent: Thursday, May 18, 2006 2:16 PM
	
	To: derby-dev@db.apache.org
	
	Subject: Re: [jira] Commented: (DERBY-883) Enhance GROUP BY
clause to support expressions instead of just column references.
	
	
	
	Thanks Alex, for your message and the blogs... It was an
interesting reading. Wonder if MetaMatrix rewrites queries with GROUP BY
expressions for Derby into SELECT subqueries... something like:
	
	select year(hiredate), month(hiredate) as month, count(*)  from
employees 
	group by year(hiredate), month(hiredate);
	
	INTO something like:
	
	select years, months, count(*) from (select year(hiredate),
month(hiredate) from employees) emp (years,months) group by years,
months;
	
	This could be used as a workaround until this functionality is
added to Derby.
	
	Satheesh
	
	
	On 5/18/06, Alex Miller < amiller@metamatrix.com
<mailto:amiller@metamatrix.com> > wrote: 

		I blogged about GROUP BY expression support across dbs a
while back 
		(
http://tech.puredanger.com/2005/03/02/fun-with-expressions-in-a-group-b
<http://tech.puredanger.com/2005/03/02/fun-with-expressions-in-a-group-b
> 
		y/ and
	
http://tech.puredanger.com/2005/04/01/update-on-expressions-in-group-by/

		).  Pretty much all the major vendors support
expressions in a GROUP BY.
		I implemented this functionality a while back for the
MetaMatrix query 
		engine and blogged about using MetaMatrix Query to "add"
this
		functionality over Derby in
	
http://devcentral.metamatrix.com/blog/alex/2006/02/28/Enhancing-Apache-D

		erby-with-MetaMatrix.
		
		Positional parameters don't make as much sense for GROUP
BY as they do
		for ORDER BY.  Logically, you're executing the clauses
in the order
		FROM-WHERE-GROUP BY-HAVING-SELECT-ORDER BY.  The SELECT
clause is the 
		point at which the output columns of the query are
effectively named and
		ordered.  Positional parameters make sense in ORDER BY
because they
		refer to the output columns of the previous phase, which
are defined as 
		part of the query, but may have no well-defined name to
refer to them
		with.
		
		In the case of GROUP BY, you would be referring
backwards from the
		SELECT clause to the GROUP BY clause, so that seems kind
of goofy.  The 
		only reason to do this would be to avoid referencing a
complex unnamed
		expression.  This is, of course, exactly the sort of
thing that db
		vendors bend the rules about to make SQL more usable.
In fact, you can
		use GROUP BY positional parameters in MySQL and Postgres
but not in any
		major commercial db that I've tried (Oracle, DB2, SQL
Server, Sybase).
		
		Alex Miller
		Chief Architect
		MetaMatrix
		
		-----Original Message----- 
		From: Satheesh Bandaram (JIRA)
[mailto:derby-dev@db.apache.org]
		Sent: Wednesday, May 17, 2006 1:14 PM
		To: derby-dev@db.apache.org 
		Subject: [jira] Commented: (DERBY-883) Enhance GROUP BY
clause to
		support expressions instead of just column references.
		
		    [
	
http://issues.apache.org/jira/browse/DERBY-883?page=comments#action_1241

		2227 ]
		
		Satheesh Bandaram commented on DERBY-883:
		-----------------------------------------
		
		Thanks for making progress on this important new
functionality, Manish. 
		
		> 1. In terms of syntax, do we allow expressions in the
group by list or
		positional parameters, or both?
		>
		> select tomonth(creationdt), toyear(creationdt),
count(*) from bugs
		> group by 1, 2; 
		
		I have seen positional parameters for ORDER BY
expressions, not
		typically used in GROUP BY. Looking at both DB2 and
Oracle
		documentation, it seems neither support positional
parameters.
		
		> An implementation question on this note-- does the
language code have 
		> a way of looking at two expressions (ValueNode?) and
checking to see
		> if they are equivalent? We'll need some way of doing
this to match an
		> expression in the group by list to an expression in
the select list 
		right?
		
		Correct. Don't think there is any existing expression
matching to
		compare two expressions. DB2 docs discuss how group by
expressions are
		matched in SQL reference manual. (Page 484:
	
ftp://ftp.software.ibm.com/ps/products/db2/info/vr82/pdf/en_US/db2s1e81
.
		pdf)
		
		> 2. I assume that an expression in a group by list must
appear in the
		> select list without aggregation right? Is this an
error? 
		>
		> select x+1, x+2, sum(y)
		> from test
		> group by x
		
		NO... This is a valid query. See the reference I
provided above.
		
		> 3. What do we do with duplicates? i.e.
		>
		> select x+1, x+1, sum(y) 
		> from test
		> group by x+1, x+1;
		>
		> Is this an error? The current implementation throws an
error if the
		> same column occurs more than once in the group by
list.
		
		I am not sure why Derby currently considers this an
error... Looking at 
		the code, it seems it may be looking for ambiguous
column references
		(like 'x' being part of two different tables in
from_list), which makes
		sense, but not sure why duplicate references should be
prevented.
		
		> Is there a standard somewhere which I should consult
before trying to
		nail down the functionality?
		
		Unfortunately, NO.... SQL 2003 seems to allow only
column references in
		GROUP BY clause. But both DB2 and Oracle allow
expressions in GROUP BY 
		list and likely allowed by other database vendors too.
You could use
		either DB2 or Oracle docs to understand how this
functionality is
		defined there. Much easier to read these docs than
confusing SQL 2003
		spec. 
		
		
		> Enhance GROUP BY clause to support expressions instead
of just column
		references.
		>
----------------------------------------------------------------------
		> -----------
		>
		>          Key: DERBY-883 
		>          URL:
http://issues.apache.org/jira/browse/DERBY-883
		>      Project: Derby
		>         Type: New Feature
		
		>   Components: SQL 
		>     Versions: 10.1.2.1
		>  Environment: JDK 1.5.0_05
		>     Reporter: Lluis Turro
		>     Assignee: Manish Khettry
		
		>
		> This select would return an error syntax on finding
"(" after month if 
		group by clause:
		> select idissue, month(creation), year(creation),
count(distinct
		> idissue) where
		>   ....
		> group by idissue, month(creation), year(creation)
		
		--
		This message is automatically generated by JIRA. 
		-
		If you think it was sent incorrectly contact one of the
administrators:
	
http://issues.apache.org/jira/secure/Administrators.jspa 
		-
		For more information on JIRA, see:
		   http://www.atlassian.com/software/jira
		
		




Mime
View raw message