db-derby-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Bryan Pendleton (JIRA)" <j...@apache.org>
Subject [jira] Commented: (DERBY-3880) NPE on a query with having clause involving a join
Date Tue, 11 Nov 2008 17:19:44 GMT

    [ https://issues.apache.org/jira/browse/DERBY-3880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12646605#action_12646605

Bryan Pendleton commented on DERBY-3880:

Hi Army, Kathey, sorry it took me a while to read through Army's excellent description of

I have a couple questions that came to mind while I was studying Army's notes:

1) Army says  "so the JoinNode's bind-time RCL is: { T1.i, T1.c, T2.i, T2.c2, T2.i2 }" but
I don't
see how that is, if the query is 

      select t1.i, avg(t2.i)
          from t1 inner join t2 on (t1.i = t2.i)
          group by t1.i having avg(t2.i) > 0; 

Is it possible that the query Army was analyzing was actually "on (t1.c = t2.c2)" ? Otherwise,
I'm having trouble seeing why T1.c and T2.c2 are in the join node's RCL. 

2) While I was reading Army's writeup, I anticipated that I was going to see a fix which
involved passing the HAVING clause expressions to fromList.flattenFromTables() so that
it could properly maintain the ColumnReference data in the HAVING clause during flattening.

But it seems like the essence of Army's fix is to observe that we don't need to maintain
the ColumnReference instances inside of AggregateNode instances because we can
always go back at the end of optimization and re-construct the ColumnReference info.

This seems very nice, because I've seen numerous problems over the past several
years which involved situations where we lost track of the ColumnReference's table
and column pointers during the compilation and optimization processing. That code
overall seems very tricky and trouble-prone. I could track down the actual JIRA id's
if you wish; they involved GROUP BY and ORDER BY analysis, which is similar, but
certainly not identical, to HAVING clause analysis.

So now I'm wondering: if it's true that we can figure out the proper ColumnReference
information at the *end* of the optimization process, and don't have to maintain
that data during the compilation/optimzation process, then perhaps we shouldn't be
doing it at all, but rather should simply wait til optimization is complete, then fix-up all
the ColumnReferences in the tree at the end.

That is, if it's safe to allow the ColumnReference in the aggregate of a HAVING clause
to "decay" (become in-accurate) during optimization, and then fix it up at the end,
why do we have to re-map the ColumnReference information for the SELECT's RCL?

And similarly could we avoid having to "fix-up" the GROUP BY and ORDER BY clauses?

Or is it only safe to allow HAVING clause ColumnReference instances to diverge
from reality for short periods, because they are fundamentally different from the
aggregate nodes in the select RCL?

3) Army says "the compile time query tree has two AggregateNodes". I've noticed
this myself. In fact, while looking at somewhat similar situations, I've noticed that
(I think) the actual code that is generated to execute the query may contain
multiple computations of the same expressions. That is, when this query runs,
is it true that we are actually going to *compute* AVG(t2.i) twice? We'll only do
one pass over the data, of course, but we'll have two running-total values that
are redundantly accumulating the t2.i sub-totals as we go?

I've been wondering if maybe there is some sort of fundamental "common sub-expression
detection" which the Derby compilation system is not performing, such that it
doesn't recognize when two parts of a query are actually computing the same
result, and so we end up generating a query which performs more computation
than is needed, redundantly computing values multiple times in situations where
they are mentioned multiple times.

I noticed this in studying GROUP BY queries, where if the GROUP BY clause
contained an expression, and that expression matched a similar expression
in the SELECT list, the resulting query seemed to compute the expression twice,
when it would have been adequate to compute it only once.

I'm not intending to "hijack the thread" with any of these questions; I'm extremely
happy that we've solved this problem and that the query now runs correctly.
I'm just trying to improve my understanding of Derby internals by taking this
opportunity to ask a few questions about parts of this issue that puzzled me.

Thanks VERY much to Army and Kathey for all the hard work they did on this problem!


> NPE on a query with having clause involving a join
> --------------------------------------------------
>                 Key: DERBY-3880
>                 URL: https://issues.apache.org/jira/browse/DERBY-3880
>             Project: Derby
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions:
>         Environment: Windows 2003 Server 
>            Reporter: Venkateswaran Iyer
>            Assignee: A B
>            Priority: Minor
>             Fix For:,,
>         Attachments: AggregateExpressionResultColumn.txt, querytree_fail.txt, querytree_works.txt
> A simple query involving a join and having clause causes a NPE. Any subsequent executions
cause severe errors. It almost looks like the underlying connection was closed out.
> ====
> C:\apps\derby\db-derby-\db-derby-\bin>ij
> ij version 10.4
> ij> connect 'jdbc:derby://speed:1527/ClassicModels;user=sa;password=sa
> ';
> ij> create table t1(i int, c varchar(20));
> 0 rows inserted/updated/deleted
> ij> create table t2(i int, c2 varchar(20), i2 int);
> 0 rows inserted/updated/deleted
> ij> insert into t1 values(1, 'abc');
> 1 row inserted/updated/deleted
> ij> insert into t1 values(2, 'abc');
> 1 row inserted/updated/deleted
> ij> insert into t2 values(1, 'xyz', 10);
> 1 row inserted/updated/deleted
> ij> insert into t2 values(1, 'aaa', 20);
> 1 row inserted/updated/deleted
> ij> insert into t2 values(2, 'xxx', 30);
> 1 row inserted/updated/deleted
> ij> select t1.i, avg(t2.i2) from t1 inner join t2 on (t1.i = t2.i) group by t1.i
>  having avg(t2.i2) > 0;
> ERROR XJ001: DERBY SQL error: SQLCODE: -1, SQLSTATE: XJ001, SQLERRMC: java.lang.
> NullPointerException¶¶XJ001.U

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message