db-derby-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "James Synge" <james.sy...@gmail.com>
Subject Re: HashJoinStrategy.feasible
Date Mon, 16 Oct 2006 18:15:08 GMT
Army, thanks for the detailed response.

I should earlier have argued here that the use of the terms
subquery and correlation in the analysis seems inappropriate.
I understand that a subquery includes its own select clause, and
that the term does not apply automatically to table expressions
(though a select statement is a table expression).

I'm perhaps being nitpicky about terminology, but it is important
(to me, at least :-)) to distinguish between standard SQL semantics
and implementation limitations/bugs.

In this case, it doesn't appear that there are any sub-queries, but instead
one cross join, two outer joins and a restriction (where clause).

Repeating for easy reference your DDL:

CREATE TABLE T (
    A INT NOT NULL,
    B DECIMAL(10,3) NOT NULL,
    C VARCHAR(5) NOT NULL);

INSERT INTO T VALUES (1, 1.0, '1'), (2, 2.0, '2'), (3, 3.0, '3');

CREATE TABLE S (
    D INT NOT NULL,
    E DECIMAL(10,3) NOT NULL,
    F VARCHAR(5) NOT NULL);

INSERT INTO S VALUES (2, 2.0, '2'), (3, 3.0, '3'), (4, 4.0, '4');

CREATE TABLE R (
    G INT NOT NULL,
    H DECIMAL(10,3) NOT NULL,
    I VARCHAR(5) NOT NULL);

INSERT INTO R VALUES (3, 3.0, '3'), (4, 4.0, '4'), (5, 5.0, '5');

CREATE TABLE TT (
    J INT NOT NULL,
    K DECIMAL(10,3) NOT NULL,
    L VARCHAR(5) NOT NULL);

INSERT INTO TT VALUES (2, 2.0, '2'), (3, 3.0, '3'), (4, 4.0, '4');

And the select statement:

select t0.*
 from
    tt as T0,
    (t as T1 left outer join
     (s as T2 left outer join r as T3 on (T2.f = T3.i))
     on (T1.b = T2.e))
 where (T0.j=T3.g);

The innermost table expression, (s as T2 left outer join r as T3 on
(T2.f = T3.i)),
produces a result with the following columns:

    t2.D INT NOT NULL
    t2.E DECIMAL(10,3) NOT NULL
    t2.F VARCHAR(5) NOT NULL
    t3.G INT
    t3.H DECIMAL(10,3)
    t3.I VARCHAR(5)

And these rows:


ij> select * from s as T2 left outer join r as T3 on (T2.f = T3.i);
D          |E           |F    |G          |H           |I
-------------------------------------------------------------
2          |2.000       |2    |NULL       |NULL        |NULL
3          |3.000       |3    |3          |3.000       |3
4          |4.000       |4    |4          |4.000       |4


An implicit constraint is that if one of G, H or I is null, they are all null.

The immediately enclosing table expression produces a result with the
following columns:

    t1.A INT NOT NULL
    t1.B DECIMAL(10,3) NOT NULL
    t1.C VARCHAR(5) NOT NULL
    t2.D INT
    t2.E DECIMAL(10,3)
    t2.F VARCHAR(5)
    t3.G INT
    t3.H DECIMAL(10,3)
    t3.I VARCHAR(5)

And these rows:


ij> select * from t as T1 left outer join (s as T2 left outer join r
as T3 on (T2.f = T3.i)) on (T1.b = T2.e);
A          |B           |C    |D          |E           |F    |G
  |H           |I
--------------------------------------------------------------------------------------------
1          |1.000       |1    |NULL       |NULL        |NULL |NULL
  |NULL        |NULL
2          |2.000       |2    |2          |2.000       |2    |NULL
  |NULL        |NULL
3          |3.000       |3    |3          |3.000       |3    |3
  |3.000       |3


An additional implicit constraint is that if one of D, E or F is null,
then all of D, E, F, G, H and I are null.

Note that there is no correlation between the left hand table
expression, "t as T1",
and the right hand table expression, "(s as T2 left outer ...".  I.e.,
there is no need
to "re-execute" the right hand table expression because is the same everytime.

Next (before the where clause) we have the cross join between the previous
result and the table tt, producing a result with the following columns:

    t0.J INT NOT NULL
    t0.K DECIMAL(10,3) NOT NULL
    t0.L VARCHAR(5) NOT NULL
    t1.A INT NOT NULL
    t1.B DECIMAL(10,3) NOT NULL
    t1.C VARCHAR(5) NOT NULL
    t2.D INT
    t2.E DECIMAL(10,3)
    t2.F VARCHAR(5)
    t3.G INT
    t3.H DECIMAL(10,3)
    t3.I VARCHAR(5)

And these rows:


ij> select * from tt as T0, (t as T1 left outer join (s as T2 left
outer join r as T3 on (T2.f = T3.i)) on (T1.b = T2.e));
J          |K           |L    |A          |B           |C    |D
  |E           |F    |G          |H           |I
---------------------------------------------------------------------------------------------------------------------------
2          |2.000       |2    |1          |1.000       |1    |NULL
  |NULL        |NULL |NULL       |NULL        |NULL
3          |3.000       |3    |1          |1.000       |1    |NULL
  |NULL        |NULL |NULL       |NULL        |NULL
4          |4.000       |4    |1          |1.000       |1    |NULL
  |NULL        |NULL |NULL       |NULL        |NULL
2          |2.000       |2    |2          |2.000       |2    |2
  |2.000       |2    |NULL       |NULL        |NULL
3          |3.000       |3    |2          |2.000       |2    |2
  |2.000       |2    |NULL       |NULL        |NULL
4          |4.000       |4    |2          |2.000       |2    |2
  |2.000       |2    |NULL       |NULL        |NULL
2          |2.000       |2    |3          |3.000       |3    |3
  |3.000       |3    |3          |3.000       |3
3          |3.000       |3    |3          |3.000       |3    |3
  |3.000       |3    |3          |3.000       |3
4          |4.000       |4    |3          |3.000       |3    |3
  |3.000       |3    |3          |3.000       |3


Applying the where clause, T0.j=T3.g, we get the following result:


ij> select * from tt as T0, (t as T1 left outer join (s as T2 left
outer join r as T3 on (T2.f = T3.i)) on (T1.b = T2.e)) where
(T0.j=T3.g);
J          |K           |L    |A          |B           |C    |D
  |E           |F    |G          |H           |I
---------------------------------------------------------------------------------------------------------------------------
3          |3.000       |3    |3          |3.000       |3    |3
  |3.000       |3    |3          |3.000       |3


Again, I believe that this is not introducing a correlation between tt
and the other table express as the results of these two table expressions
don't change based on the addition of the where clause.

One interesting thing to note here is that since T0.j=T3.g is not
true unless T3.g is NOT NULL, this clause amounts to saying that
we don't actually want any of the extra rows that the two outer joins
produce (as in those rows T3.g is always NULL).  I don't know if Derby
includes any such analysis/re-writing, nor do I know how valuable it
might be, but it is an interesting thought.

I'm left with the conclusion that Army comes to:

> ...
> If that's correct, I wonder if HashTableNode's qualifier is somehow being
> evaluated incorrectly against the materialization of JoinNode[4], leading to
> missing rows?  Unfortunately, I don't know what the answer to that question is...
>
> So long story short: theoretically it seems like hash joins between FromTables
> at different nesting levels should be possible.  But in actuality such joins can
> lead to incorrect results.  I addressed this by disallowing such hash joins as
> described in the DERBY-781 document.  But you're right, maybe this is something
> worth investigating more...?

It does indeed sound as if the predicates/qualifiers are being misapplied.
My focus for now is on learning enough to address Derby-47, so I'm not able
to dive in to this.

James

Mime
View raw message