Return-Path: Delivered-To: apmail-db-derby-dev-archive@www.apache.org Received: (qmail 60811 invoked from network); 16 Oct 2006 18:15:34 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (209.237.227.199) by minotaur.apache.org with SMTP; 16 Oct 2006 18:15:34 -0000 Received: (qmail 83418 invoked by uid 500); 16 Oct 2006 18:15:33 -0000 Delivered-To: apmail-db-derby-dev-archive@db.apache.org Received: (qmail 83372 invoked by uid 500); 16 Oct 2006 18:15:33 -0000 Mailing-List: contact derby-dev-help@db.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: Delivered-To: mailing list derby-dev@db.apache.org Received: (qmail 83361 invoked by uid 99); 16 Oct 2006 18:15:33 -0000 Received: from asf.osuosl.org (HELO asf.osuosl.org) (140.211.166.49) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 16 Oct 2006 11:15:32 -0700 X-ASF-Spam-Status: No, hits=0.5 required=10.0 tests=DNS_FROM_RFC_ABUSE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (asf.osuosl.org: domain of james.synge@gmail.com designates 64.233.182.190 as permitted sender) Received: from [64.233.182.190] (HELO nf-out-0910.google.com) (64.233.182.190) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 16 Oct 2006 11:15:30 -0700 Received: by nf-out-0910.google.com with SMTP id p77so1968949nfc for ; Mon, 16 Oct 2006 11:15:08 -0700 (PDT) DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws; s=beta; d=gmail.com; h=received:message-id:date:from:reply-to:to:subject:mime-version:content-type:content-transfer-encoding:content-disposition; b=hSLfObQ9Nsi+eYWmclTOFvqsQKp9qakgEUHBuJLCXdafBR0ltvd94LIJmC4MkN7XGcQRoT6KwY5vVyoVuekBuzTCx+y/D9TSotiEKdnBM9bPIn7qOeP7dTUwX2UG2HKs9Q3sr9+RWY2lICRRxO2A3ICU8um6pO9w3/4ZeV5OWvs= Received: by 10.49.75.2 with SMTP id c2mr12653288nfl; Mon, 16 Oct 2006 11:15:08 -0700 (PDT) Received: by 10.49.90.1 with HTTP; Mon, 16 Oct 2006 11:15:08 -0700 (PDT) Message-ID: Date: Mon, 16 Oct 2006 14:15:08 -0400 From: "James Synge" Reply-To: james.synge@acm.org To: derby-dev@db.apache.org Subject: Re: HashJoinStrategy.feasible MIME-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Content-Disposition: inline X-Virus-Checked: Checked by ClamAV on apache.org X-Spam-Rating: minotaur.apache.org 1.6.2 0/1000/N Army, thanks for the detailed response. I should earlier have argued here that the use of the terms subquery and correlation in the analysis seems inappropriate. I understand that a subquery includes its own select clause, and that the term does not apply automatically to table expressions (though a select statement is a table expression). I'm perhaps being nitpicky about terminology, but it is important (to me, at least :-)) to distinguish between standard SQL semantics and implementation limitations/bugs. In this case, it doesn't appear that there are any sub-queries, but instead one cross join, two outer joins and a restriction (where clause). Repeating for easy reference your DDL: CREATE TABLE T ( A INT NOT NULL, B DECIMAL(10,3) NOT NULL, C VARCHAR(5) NOT NULL); INSERT INTO T VALUES (1, 1.0, '1'), (2, 2.0, '2'), (3, 3.0, '3'); CREATE TABLE S ( D INT NOT NULL, E DECIMAL(10,3) NOT NULL, F VARCHAR(5) NOT NULL); INSERT INTO S VALUES (2, 2.0, '2'), (3, 3.0, '3'), (4, 4.0, '4'); CREATE TABLE R ( G INT NOT NULL, H DECIMAL(10,3) NOT NULL, I VARCHAR(5) NOT NULL); INSERT INTO R VALUES (3, 3.0, '3'), (4, 4.0, '4'), (5, 5.0, '5'); CREATE TABLE TT ( J INT NOT NULL, K DECIMAL(10,3) NOT NULL, L VARCHAR(5) NOT NULL); INSERT INTO TT VALUES (2, 2.0, '2'), (3, 3.0, '3'), (4, 4.0, '4'); And the select statement: select t0.* from tt as T0, (t as T1 left outer join (s as T2 left outer join r as T3 on (T2.f = T3.i)) on (T1.b = T2.e)) where (T0.j=T3.g); The innermost table expression, (s as T2 left outer join r as T3 on (T2.f = T3.i)), produces a result with the following columns: t2.D INT NOT NULL t2.E DECIMAL(10,3) NOT NULL t2.F VARCHAR(5) NOT NULL t3.G INT t3.H DECIMAL(10,3) t3.I VARCHAR(5) And these rows: ij> select * from s as T2 left outer join r as T3 on (T2.f = T3.i); D |E |F |G |H |I ------------------------------------------------------------- 2 |2.000 |2 |NULL |NULL |NULL 3 |3.000 |3 |3 |3.000 |3 4 |4.000 |4 |4 |4.000 |4 An implicit constraint is that if one of G, H or I is null, they are all null. The immediately enclosing table expression produces a result with the following columns: t1.A INT NOT NULL t1.B DECIMAL(10,3) NOT NULL t1.C VARCHAR(5) NOT NULL t2.D INT t2.E DECIMAL(10,3) t2.F VARCHAR(5) t3.G INT t3.H DECIMAL(10,3) t3.I VARCHAR(5) And these rows: ij> select * from t as T1 left outer join (s as T2 left outer join r as T3 on (T2.f = T3.i)) on (T1.b = T2.e); A |B |C |D |E |F |G |H |I -------------------------------------------------------------------------------------------- 1 |1.000 |1 |NULL |NULL |NULL |NULL |NULL |NULL 2 |2.000 |2 |2 |2.000 |2 |NULL |NULL |NULL 3 |3.000 |3 |3 |3.000 |3 |3 |3.000 |3 An additional implicit constraint is that if one of D, E or F is null, then all of D, E, F, G, H and I are null. Note that there is no correlation between the left hand table expression, "t as T1", and the right hand table expression, "(s as T2 left outer ...". I.e., there is no need to "re-execute" the right hand table expression because is the same everytime. Next (before the where clause) we have the cross join between the previous result and the table tt, producing a result with the following columns: t0.J INT NOT NULL t0.K DECIMAL(10,3) NOT NULL t0.L VARCHAR(5) NOT NULL t1.A INT NOT NULL t1.B DECIMAL(10,3) NOT NULL t1.C VARCHAR(5) NOT NULL t2.D INT t2.E DECIMAL(10,3) t2.F VARCHAR(5) t3.G INT t3.H DECIMAL(10,3) t3.I VARCHAR(5) And these rows: ij> select * from tt as T0, (t as T1 left outer join (s as T2 left outer join r as T3 on (T2.f = T3.i)) on (T1.b = T2.e)); J |K |L |A |B |C |D |E |F |G |H |I --------------------------------------------------------------------------------------------------------------------------- 2 |2.000 |2 |1 |1.000 |1 |NULL |NULL |NULL |NULL |NULL |NULL 3 |3.000 |3 |1 |1.000 |1 |NULL |NULL |NULL |NULL |NULL |NULL 4 |4.000 |4 |1 |1.000 |1 |NULL |NULL |NULL |NULL |NULL |NULL 2 |2.000 |2 |2 |2.000 |2 |2 |2.000 |2 |NULL |NULL |NULL 3 |3.000 |3 |2 |2.000 |2 |2 |2.000 |2 |NULL |NULL |NULL 4 |4.000 |4 |2 |2.000 |2 |2 |2.000 |2 |NULL |NULL |NULL 2 |2.000 |2 |3 |3.000 |3 |3 |3.000 |3 |3 |3.000 |3 3 |3.000 |3 |3 |3.000 |3 |3 |3.000 |3 |3 |3.000 |3 4 |4.000 |4 |3 |3.000 |3 |3 |3.000 |3 |3 |3.000 |3 Applying the where clause, T0.j=T3.g, we get the following result: ij> select * from tt as T0, (t as T1 left outer join (s as T2 left outer join r as T3 on (T2.f = T3.i)) on (T1.b = T2.e)) where (T0.j=T3.g); J |K |L |A |B |C |D |E |F |G |H |I --------------------------------------------------------------------------------------------------------------------------- 3 |3.000 |3 |3 |3.000 |3 |3 |3.000 |3 |3 |3.000 |3 Again, I believe that this is not introducing a correlation between tt and the other table express as the results of these two table expressions don't change based on the addition of the where clause. One interesting thing to note here is that since T0.j=T3.g is not true unless T3.g is NOT NULL, this clause amounts to saying that we don't actually want any of the extra rows that the two outer joins produce (as in those rows T3.g is always NULL). I don't know if Derby includes any such analysis/re-writing, nor do I know how valuable it might be, but it is an interesting thought. I'm left with the conclusion that Army comes to: > ... > If that's correct, I wonder if HashTableNode's qualifier is somehow being > evaluated incorrectly against the materialization of JoinNode[4], leading to > missing rows? Unfortunately, I don't know what the answer to that question is... > > So long story short: theoretically it seems like hash joins between FromTables > at different nesting levels should be possible. But in actuality such joins can > lead to incorrect results. I addressed this by disallowing such hash joins as > described in the DERBY-781 document. But you're right, maybe this is something > worth investigating more...? It does indeed sound as if the predicates/qualifiers are being misapplied. My focus for now is on learning enough to address Derby-47, so I'm not able to dive in to this. James