BEGIN PGP SIGNED MESSAGE
Hash: SHA1
I have always thought one hole in the derby query processing strategy
is that it does not support any sort of sort/merge query node. Could
your new node be used for this?
Does your code take advantage of the case when the rows are already
sorted? I believe the following case is very common:
TABLE A with PRIMARY KEY A_KEY
TABLE B with PRIMARY KEY B_KEY
Then a join, intersect or except is done on the 2 tables on the 2
primary keys. In this case it is very easy for the store to return
an ordered stream of rows from each of the tables. In fact if the
either side is known to be unique I believe that there would be no
need to ever have more than one row in memory at a time from either
side.
I believe the duplicate key case is harder, one would either need
to keep all the duplicate key cases in memory, or the store could
provide a "rewind" mechanism to quickly reset the stream of rows back
to the beginning of the duplicate chain.
I don't know very much about intersect or except but for a simple
join on 2 unique keys which already have indexes on them it seems
obvious to me that a sort merge has to perform better than a hash
join and would the difference increases as the size of result set
on either size increases.
Jack Klebanoff wrote:
 RPost wrote:

> Your text says 'If ALL is specified then duplicates are not returned'.
> You
> also say that by default duplicates are removed. So is the default
> ALL? Or
> did you mean that if ALL is specified then duplicates are not removed?
>
> If you don't need to remove duplicates then doesn't that mean that you
> only
> need to sort one of the data sets? If so, this would improve
> performance. If
> statistics are available then the smaller dataset should be the one
> that is
> sorted.
>
 I am sorry, my typo. My message should have read "If ALL is specified
 then duplicates *are* returned". The default is to remove duplicates,
 ALL is not the default.

 Sorting both inputs makes the final selection easier and faster even
 when duplicates are not removed. As RPost suggests, we could implement
 INTERSECT ALL and EXCEPT ALL by just sorting one of the inputs, avoiding
 part of the sort cost. Then we would scan through each row in the
 unsorted input and use a binary search to see if it is in the sorted
 input. In order to output the correct number of duplicates we must mark
 the row as removed from the sorted input. I don't think that the
 org.apache.derby.iapi.sql.ResultSet interface or the SortedResultSet
 class currently implement row deletion or marking.

 If the number of rows in the two inputs are m and n the cost of sorting
 both is O(n*log(n) + m*log(m)). The final select step has a cost of
 O(n+m). If we only sort one input, say the second one, then the sort
 cost is O(n*log(n)) and the cost of the final select step is
 O(m*log(n)), for a total cost O((m+n)*log(n)). If n < m this is
 asymptotically better. The binary searches will have poor locality of
 reference and deleting rows from a temporary table that has spilled to
 disk may be expensive, so in practice sorting just one input may only be
 better when the two inputs have significantly different sizes. Perhaps
 in the future we can implement both methods and pick the better one at
 optimize time.

 I wanted to make an initial implementation that is simple, handles both
 INTERSECT and EXCEPT, and that fits into the current Derby
 implementation fairly easily. If the performance of INTERSECT or EXCEPT
 proves to be important then we should revisit this.

 In fact there are a number of related implementation alternatives that
 may be optimal under different conditions.
 1. Just sort the smaller input and use a binary search to see if each
 row in the unsorted input is also in the sorted input. INTERSECT is
 commutative, so it does not matter which of its inputs is sorted. EXCEPT
 is not commutative. If we sort its left input and scan through the
 unsorted right input we have to remove/mark the intersection rows in the
 left (sorted) input and output the unremoved/marked rows at the end.
 2. We can also handle duplicate elimination when just sorting one input
 by removing duplicates when doing the sort, but not removing/marking
 rows from the sorted input as they are found in the other input.
 3. We can use a hash table instead of sorting. Unfortunately Derby does
 not yet have a hash table implementation that spills to disk.

 Jack
 Jack Klebanoff

BEGIN PGP SIGNATURE
Version: GnuPG v1.2.5 (MingW32)
Comment: Using GnuPG with Thunderbird  http://enigmail.mozdev.org
iD8DBQFBue58EpeslyHqPs0RAoQZAKDHt4+qtnBU90rbKM+BgFcKTm+s2QCfSxfz
b/ooqn8eCrOiM0cKcivk1bU=
=+WdC
END PGP SIGNATURE
