RPost wrote:
>Your text says 'If ALL is specified then duplicates are not returned'. You
>also say that by default duplicates are removed. So is the default ALL? Or
>did you mean that if ALL is specified then duplicates are not removed?
>
>If you don't need to remove duplicates then doesn't that mean that you only
>need to sort one of the data sets? If so, this would improve performance. If
>statistics are available then the smaller dataset should be the one that is
>sorted.
>
I am sorry, my typo. My message should have read "If ALL is specified
then duplicates *are* returned". The default is to remove duplicates,
ALL is not the default.
Sorting both inputs makes the final selection easier and faster even
when duplicates are not removed. As RPost suggests, we could implement
INTERSECT ALL and EXCEPT ALL by just sorting one of the inputs, avoiding
part of the sort cost. Then we would scan through each row in the
unsorted input and use a binary search to see if it is in the sorted
input. In order to output the correct number of duplicates we must mark
the row as removed from the sorted input. I don't think that the
org.apache.derby.iapi.sql.ResultSet interface or the SortedResultSet
class currently implement row deletion or marking.
If the number of rows in the two inputs are m and n the cost of sorting
both is O(n*log(n) + m*log(m)). The final select step has a cost of
O(n+m). If we only sort one input, say the second one, then the sort
cost is O(n*log(n)) and the cost of the final select step is
O(m*log(n)), for a total cost O((m+n)*log(n)). If n < m this is
asymptotically better. The binary searches will have poor locality of
reference and deleting rows from a temporary table that has spilled to
disk may be expensive, so in practice sorting just one input may only be
better when the two inputs have significantly different sizes. Perhaps
in the future we can implement both methods and pick the better one at
optimize time.
I wanted to make an initial implementation that is simple, handles both
INTERSECT and EXCEPT, and that fits into the current Derby
implementation fairly easily. If the performance of INTERSECT or EXCEPT
proves to be important then we should revisit this.
In fact there are a number of related implementation alternatives that
may be optimal under different conditions.
1. Just sort the smaller input and use a binary search to see if each
row in the unsorted input is also in the sorted input. INTERSECT is
commutative, so it does not matter which of its inputs is sorted. EXCEPT
is not commutative. If we sort its left input and scan through the
unsorted right input we have to remove/mark the intersection rows in the
left (sorted) input and output the unremoved/marked rows at the end.
2. We can also handle duplicate elimination when just sorting one input
by removing duplicates when doing the sort, but not removing/marking
rows from the sorted input as they are found in the other input.
3. We can use a hash table instead of sorting. Unfortunately Derby does
not yet have a hash table implementation that spills to disk.
Jack
Jack Klebanoff
