I have attached an update to my previous INTERSECT/EXCEPT patch. It
addresses a number of concerns brought up in this thread.
1. It includes the fix for order by syntax checking posted by Satheesh.
2. It fixes the execution of an order by clause in the statement.
Previously an order by clause was ignored.
3. The intersect/except is still implemented by sorting its two inputs,
but optimizer is given a chance to perform sort avoidance on the inputs.
4. If there is an order by clause on the output of the intersect/except
then that ordering is used for the inputs, avoiding a separate sort on
the intersect/except output.
5. Test cases were added to cover the above code.
6. The copyright notices for the new files were changed to 2004.
7. The SetOpProjectRestrictResultSet class was renamed to
SetOpResultSet, which is more appropriate.
8. The IntersectNode and ExceptNode classes were removed and subsumed in
the IntersectOrExceptNode class.
Some of the concerns about optimization were not entirely addressed.
There is still just the one execution strategy: sort the two inputs and
scan them together. I did not implement other strategies and an
optimizer that picks the best of them. I think that my implementation
strategy performs decently in all cases, and is the best in many cases.
I don't think that it is wise to write a lot of code to optimize an
operation that is probably only used infrequently. (Cloudscape customers
have gotten along without it for all these years).
While this update allows the optimizer to avoid sorting the
intersect/except inputs it does not try to pick an ordering that is more
likely to avoid a sort. For instance, suppose one of the inputs is a
single table select that selects the columns of a unique key. Then you
only have to order that input on the key columns. Depending on the where
clause the optimizer might decide to use the unique key index to
traverse the table in sort order, avoiding a separate sort. The other
input to the intersect/except may be a different story. Those same
columns may not specify a unique key in the other input, in which case
that input must be ordered on more columns.
Unfortunately the Derby compiler architecture does not make it easy for
the IntersectOrExceptNode class to determine a column ordering that is
likely to avoid a sort on its inputs. Its inputs are represented as
ResultSetNodes. It is not easy to determine a ResultSetNode represents a
single table select, and if so whether the selected columns contain a
unique key. One would like to try avoiding a sort on the larger input,
and then try a column ordering that might avoid a sort on the smaller
input if the optimizer cannot avoid sorting the larger input.
Unfortunately this does not work: the architecture requires that the
order by clause be pushed down by the start of optimization.
It is easy to see if the output of the intersect/except must be ordered
and use this ordering to order the inputs, avoiding a separate sort on
the output of the intersect/except. I did this in the attached patch update.
Jack
Jack Klebanoff wrote:
> Attached is a patch that implements the SQL INTERSECT and EXCEPT
> operators. The INTERSECT operator constructs the intersection of two
> tables. The EXCEPT operator finds all rows in one table but not in the
> other. The syntax is (roughly):
>
> <query expression> INTERSECT [ALL] <query expression>
> <query expression> EXCEPT [ALL] <query expression>
>
> By default these operators remove duplicates, which can occur if there
> are duplicates in the inputs. If ALL is specified then duplicates are
> not returned. If t1 has m copies of row R and t2 has n copies then t1
> INTERSECT ALL t2 returns min(m,n) copies of R, and t1 EXCEPT ALL t2
> returns max( 0, m-n) copies of R.
>
> The EXCEPT operator has the same precedence as UNION. INTERSECT has
> higher precedence.
>
> This follows the SQL-92 spec. (At least it follows my understanding of
> the spec. Spec lawyers are invited to comment).
>
> The implementation uses sorting. The two input tables are sorted and
> then scanned together. The appropriate rows from the left input are
> output.
>
> The compiler binds INTERSECT and EXCEPT like UNION. Therefore a new
> class, org.apache.derby.impl.sql.compile.SetOperatorNode, was carved
> out of UnionNode. It mainly contains bind methods. Classes UnionNode
> and IntersectOrExceptNode extend SetOperatorNode. Classes
> IntersectNode and ExceptNode extend IntersectOrExceptNode.
> IntersectOrExceptNode does most of the optimization and code
> generation work. It puts OrderBy nodes in front of its inputs.
>
> The generated code creates a SetOpProjectRestrictResultSet that reads
> its sorted inputs to produce the required output table.
>
> Jack Klebanoff
>
>
|