db-derby-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jack Klebanoff <kleba...@Mutagen.Net>
Subject Re: [PATCH] Intersect and Except
Date Wed, 15 Dec 2004 03:50:36 GMT
I have attached an update to my previous INTERSECT/EXCEPT patch. It 
addresses a number of concerns brought up in this thread.

1. It includes the fix for order by syntax checking posted by Satheesh.
2. It fixes the execution of an order by clause in the statement. 
Previously an order by clause was ignored.
3. The intersect/except is still implemented by sorting its two inputs, 
but optimizer is given a chance to perform sort avoidance on the inputs.
4. If there is an order by clause on the output of the intersect/except 
then that ordering is used for the inputs, avoiding a separate sort on 
the intersect/except output.
5. Test cases were added to cover the above code.
6. The copyright notices for the new files were changed to 2004.
7. The SetOpProjectRestrictResultSet class was renamed to 
SetOpResultSet, which is more appropriate.
8. The IntersectNode and ExceptNode classes were removed and subsumed in 
the IntersectOrExceptNode class.

Some of the concerns about optimization were not entirely addressed.

There is still just the one execution strategy: sort the two inputs and 
scan them together. I did not implement other strategies and an 
optimizer that picks the best of them. I think that my implementation 
strategy performs decently in all cases, and is the best in many cases. 
I don't think that it is wise to write a lot of code to optimize an 
operation that is probably only used infrequently. (Cloudscape customers 
have gotten along without it for all these years).

While this update allows the optimizer to avoid sorting the 
intersect/except inputs it does not try to pick an ordering that is more 
likely to avoid a sort. For instance, suppose one of the inputs is a 
single table select that selects the columns of a unique key. Then you 
only have to order that input on the key columns. Depending on the where 
clause the optimizer might decide to use the unique key index to 
traverse the table in sort order, avoiding a separate sort. The other 
input to the intersect/except may be a different story. Those same 
columns may not specify a unique key in the other input, in which case 
that input must be ordered on more columns.

Unfortunately the Derby compiler architecture does not make it easy for 
the IntersectOrExceptNode class to determine a column ordering that is 
likely to avoid a sort on its inputs. Its inputs are represented as 
ResultSetNodes. It is not easy to determine a ResultSetNode represents a 
single table select, and if so whether the selected columns contain a 
unique key. One would like to try avoiding a sort on the larger input, 
and then try a column ordering that might avoid a sort on the smaller 
input if the optimizer cannot avoid sorting the larger input. 
Unfortunately this does not work: the architecture requires that the 
order by clause be pushed down by the start of optimization.

It is easy to see if the output of the intersect/except must be ordered 
and use this ordering to order the inputs, avoiding a separate sort on 
the output of the intersect/except. I did this in the attached patch update.

Jack

Jack Klebanoff wrote:

> Attached is a patch that implements the SQL INTERSECT and EXCEPT 
> operators. The INTERSECT operator constructs the intersection of two 
> tables. The EXCEPT operator finds all rows in one table but not in the 
> other. The syntax is (roughly):
>
> <query expression> INTERSECT [ALL] <query expression>
> <query expression> EXCEPT [ALL] <query expression>
>
> By default these operators remove duplicates, which can occur if there 
> are duplicates in the inputs. If ALL is specified then duplicates are 
> not returned. If t1 has m copies of row R and t2 has n copies then t1 
> INTERSECT ALL t2 returns min(m,n) copies of R, and t1 EXCEPT ALL t2 
> returns max( 0, m-n) copies of R.
>
> The EXCEPT operator has the same precedence as UNION. INTERSECT has 
> higher precedence.
>
> This follows the SQL-92 spec. (At least it follows my understanding of 
> the spec. Spec lawyers are invited to comment).
>
> The implementation uses sorting. The two input tables are sorted and 
> then scanned together. The appropriate rows from the left input are 
> output.
>
> The compiler binds INTERSECT and EXCEPT like UNION. Therefore a new 
> class, org.apache.derby.impl.sql.compile.SetOperatorNode, was carved 
> out of UnionNode. It mainly contains bind methods. Classes UnionNode 
> and IntersectOrExceptNode extend SetOperatorNode. Classes 
> IntersectNode and ExceptNode extend IntersectOrExceptNode. 
> IntersectOrExceptNode does most of the optimization and code 
> generation work. It puts OrderBy nodes in front of its inputs.
>
> The generated code creates a SetOpProjectRestrictResultSet that reads 
> its sorted inputs to produce the required output table.
>
> Jack Klebanoff
>
>


Mime
View raw message