db-derby-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jack Klebanoff <kleba...@Mutagen.Net>
Subject Re: [PATCH] Intersect and Except
Date Fri, 10 Dec 2004 04:57:38 GMT
RPost wrote:

>Your text says 'If ALL is specified then duplicates are not returned'. You
>also say that by default duplicates are removed. So is the default ALL? Or
>did you mean that if ALL is specified then duplicates are not removed?
>
>If you don't need to remove duplicates then doesn't that mean that you only
>need to sort one of the data sets? If so, this would improve performance. If
>statistics are available then the smaller dataset should be the one that is
>sorted.
>
I am sorry, my typo. My message should have read "If ALL is specified 
then duplicates *are* returned". The default is to remove duplicates, 
ALL is not the default.

Sorting both inputs makes the final selection easier and faster even 
when duplicates are not removed. As RPost suggests, we could implement 
INTERSECT ALL and EXCEPT ALL by just sorting one of the inputs, avoiding 
part of the sort cost. Then we would scan through each row in the 
unsorted input and use a binary search to see if it is in the sorted 
input. In order to output the correct number of duplicates we must mark 
the row as removed from the sorted input. I don't think that the 
org.apache.derby.iapi.sql.ResultSet interface or the SortedResultSet 
class currently implement row deletion or marking.

If the number of rows in the two inputs are m and n the cost of sorting 
both is O(n*log(n) + m*log(m)). The final select step has a cost of  
O(n+m). If we only sort one input, say the second one, then the sort 
cost is O(n*log(n)) and the cost of the final select step is 
O(m*log(n)), for a total cost O((m+n)*log(n)). If n < m this is 
asymptotically better. The binary searches will have poor locality of 
reference and deleting rows from a temporary table that has spilled to 
disk may be expensive, so in practice sorting just one input may only be 
better when the two inputs have significantly different sizes. Perhaps 
in the future we can implement both methods and pick the better one at 
optimize time.

I wanted to make an initial implementation that is simple, handles both 
INTERSECT and EXCEPT, and that fits into the current Derby 
implementation fairly easily. If the performance of INTERSECT or EXCEPT 
proves to be important then we should revisit this.

In fact there are a number of  related implementation alternatives that 
may be optimal under different conditions.
1. Just sort the smaller input and use a binary search to see if each 
row in the unsorted input is also in the sorted input. INTERSECT is 
commutative, so it does not matter which of its inputs is sorted. EXCEPT 
is not commutative. If we sort its left input and scan through the 
unsorted right input we have to remove/mark the intersection rows in the 
left (sorted) input and output the unremoved/marked rows at the end.
2. We can also handle duplicate elimination when just sorting one input 
by removing duplicates when doing the sort, but not removing/marking 
rows from the sorted input as they are found in the other input.
3. We can use a hash table instead of sorting. Unfortunately Derby does 
not yet have a hash table implementation that spills to disk.

Jack
Jack Klebanoff

Mime
View raw message