db-derby-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Army <qoz...@gmail.com>
Subject Re: HashJoinStrategy.feasible
Date Wed, 11 Oct 2006 22:33:00 GMT
James Synge wrote:
> Army mentioned earlier the analysis he wrote for Derby-781 (Subquery
> materialization Hash Join).  I've read most of that, 

Thanks for reading the DERBY-781 writeup. It's always nice to know other people 
are double-checking the analysis and asking questions--especially with the 
optimizer since, as you probably know by now, there's an endless supply of 
analysis to do and questions to ask...

> and am confused by his analysis of this select:
> 
> Q2. select 1 from tt, (t left outer join (s left outer join r on (f =
> i)) on (b = e)) where (j=g)
> 
> Judging from the analysis, this query could be less ambiguously 
> expressed as:
> 
> select 1
> from
>    tt as T0,
>    (t as T1 left outer join
>     (s as T2 left outer join r as T3 on (T2.f = T3.i))
>     on (T1.b = T2.e))
> where (T0.j=T3.g)
> 
> In particular, I don't understand the claim that the subqueries can
> not be materialized because of the fact that the (j=g) and (b=e) 
> predicates create a correlation between the levels.  

I'm glad you mentioned the predicate (b=e).  In the discussion I focused on 
(j=g) but as it turns out, it's actually (b=e) that makes the difference in this 
particular query.  The logic for both queries is the same (w.r.t to the 
different nesting levels) but for this specific scenario, (b=e) is the predicate 
of importance.

> I can't see any theoretical reason why the following table expression
> couldn't be materialized:
> 
>    s as T2 left outer join r as T3 on (T2.f = T3.i)

This expression in and of itself can in fact be materialized.  In this 
*isolated* query the predicate is between two tables that are at the same 
nesting level--namely, T2 and T3. So if all we look at is (T2.f=T3.i), then yes, 
as you say, there is no correlation between this expression and the rest of the 
FROM clause, and thus this piece can be safely materialized.

But in the bigger query, (T2.f=T3.i) is not the only predicate.  We also have 
(T0.j=T3.g) and (T1.b=T2.e).  And as I mentioned above, it's the predicate 
(T1.b=T2.e) which, for this example, renders materialization unsafe.

> Furthermore, there is no correlation between:
> 
>    (t as T1 left outer join (s as T2 left outer join r as T3 on (T2.f
> = T3.i)) on (T1.b = T2.e))
> 
> And the rest of the from clause (just TT in this case).

The correlation comes via the predicate (T0.j=T3.g), which is specified as part 
of the WHERE clause in the top-level query.  That predicate joins TT with the 
HalfOuterJoinNode, hence the correlation.  (Is that what you were asking?)

> My understanding of the semantics of select are that (logically) the
> FROM clause is executed first to produce a working table (in this 
> case a CROSS JOIN of TT with the result of the left outer joins, 
> which are evaluated independently of TT).

Logically, I think you are right here.  Put differently, I think what you're 
expecting is that the HalfOuterJoin node should be evaluated to some result set, 
and then the hash join can occur between TT and that result set using 
(T0.j=T3.g) as the equijoin predicate.

Or, in the case of (T1.b=T2.e), JoinNode[4] should be evaluated to some result 
set and then the hash join can occur between T1 and that JoinNode.

I agree that logically it seems like this should work.  But alas, it seems like 
something under the covers does not follow this logical separation of hash join 
vs. result-set-creation.  Take, for example, the following tables and data, and 
then run the less ambiguous version of your query:

CREATE TABLE T (A INT NOT NULL, B DECIMAL(10,3) NOT NULL, C VARCHAR(5) NOT NULL);
INSERT INTO T VALUES (1, 1.0, '1'), (2, 2.0, '2'), (3, 3.0, '3');

CREATE TABLE S (D INT NOT NULL, E DECIMAL(10,3) NOT NULL, F VARCHAR(5) NOT NULL);
INSERT INTO S VALUES (2, 2.0, '2'), (3, 3.0, '3'), (4, 4.0, '4');

CREATE TABLE R (G INT NOT NULL, H DECIMAL(10,3) NOT NULL, I VARCHAR(5) NOT NULL);
INSERT INTO R VALUES (3, 3.0, '3'), (4, 4.0, '4'), (5, 5.0, '5');

CREATE TABLE TT (J INT NOT NULL, K DECIMAL(10,3) NOT NULL, L VARCHAR(5) NOT NULL);
INSERT INTO TT VALUES (2, 2.0, '2'), (3, 3.0, '3'), (4, 4.0, '4');

Currently Derby will (correctly) return one row:

ij> select t0.*
  from
     tt as T0,
     (t as T1 left outer join
      (s as T2 left outer join r as T3 on (T2.f = T3.i))
      on (T1.b = T2.e))
  where (T0.j=T3.g);
J          |K           |L
------------------------------
3          |3.000       |3

But without the changes for DERBY-781 that led to this discussion, the query 
returns zero rows.  The difference between the two plans is that, in the latter, 
the optimizer does a hash join between T1 and JoinNode[4], but in the former it 
does not.

So given that I came to the conclusion that hash joins are unsafe if the 
predicate references FromTables at different nesting levels.  And I think that 
conclusion is still correct based on the Derby behavior.

I cannot, however, say for certain *why* such a hash join is unsafe.  I think 
you're right in asking this question and I'm glad you did...but I don't have a 
good answer.  Perhaps something is in fact amiss somewhere?

My thinking (fwiw): the in-memory hash table for non-base-tables is represented 
by a HashTableNode that is created in the "modifyAccessPaths()" method of 
ProjectRestrictNode.  At generation time the join predicate is generated as a 
qualifier for the HashTableNode's child--i.e. for JoinNode[4] in our example. 
(See the generateMinion() method in HashTableNode).  My understanding is that, 
at execution time, the JoinNode[4] result set is materialized and probed with 
the first row in T1.  Then for subsequent rows in T1, the materialized result 
set is simply probed with the appropriate column value from the T1 row.

If that's correct, I wonder if HashTableNode's qualifier is somehow being 
evaluated incorrectly against the materialization of JoinNode[4], leading to 
missing rows?  Unfortunately, I don't know what the answer to that question is...

So long story short: theoretically it seems like hash joins between FromTables 
at different nesting levels should be possible.  But in actuality such joins can 
lead to incorrect results.  I addressed this by disallowing such hash joins as 
described in the DERBY-781 document.  But you're right, maybe this is something 
worth investigating more...?

Army


Mime
View raw message