db-derby-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mike Matrigali <mikem_...@sbcglobal.net>
Subject Re: [PATCH] Intersect and Except
Date Fri, 10 Dec 2004 18:44:13 GMT
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

I have always thought one hole in the derby query processing strategy
is that it does not support any sort of sort/merge query node.  Could
your new node be used for this?

Does your code take advantage of the case when the rows are already
sorted?  I believe the following case is very common:

TABLE A with PRIMARY KEY A_KEY
TABLE B with PRIMARY KEY B_KEY

Then a join, intersect or except is done on the 2 tables on the 2
primary keys.  In this case it is very easy for the store to return
an ordered stream of rows from each of the tables.  In fact if the
either side is known to be unique I believe that there would be no
need to ever have more than one row in memory at a time from either
side.

I believe the duplicate key case is harder, one would either need
to keep all the duplicate key cases in memory, or the store could
provide a "rewind" mechanism to quickly reset the stream of rows back
to the beginning of the duplicate chain.

I don't know very much about intersect or except but for a simple
join on 2 unique keys which already have indexes on them it seems
obvious to me that a sort merge has to perform better than a hash
join and would the difference increases as the size of result set
on either size increases.

Jack Klebanoff wrote:
| RPost wrote:
|
|> Your text says 'If ALL is specified then duplicates are not returned'.
|> You
|> also say that by default duplicates are removed. So is the default
|> ALL? Or
|> did you mean that if ALL is specified then duplicates are not removed?
|>
|> If you don't need to remove duplicates then doesn't that mean that you
|> only
|> need to sort one of the data sets? If so, this would improve
|> performance. If
|> statistics are available then the smaller dataset should be the one
|> that is
|> sorted.
|>
| I am sorry, my typo. My message should have read "If ALL is specified
| then duplicates *are* returned". The default is to remove duplicates,
| ALL is not the default.
|
| Sorting both inputs makes the final selection easier and faster even
| when duplicates are not removed. As RPost suggests, we could implement
| INTERSECT ALL and EXCEPT ALL by just sorting one of the inputs, avoiding
| part of the sort cost. Then we would scan through each row in the
| unsorted input and use a binary search to see if it is in the sorted
| input. In order to output the correct number of duplicates we must mark
| the row as removed from the sorted input. I don't think that the
| org.apache.derby.iapi.sql.ResultSet interface or the SortedResultSet
| class currently implement row deletion or marking.
|
| If the number of rows in the two inputs are m and n the cost of sorting
| both is O(n*log(n) + m*log(m)). The final select step has a cost of
| O(n+m). If we only sort one input, say the second one, then the sort
| cost is O(n*log(n)) and the cost of the final select step is
| O(m*log(n)), for a total cost O((m+n)*log(n)). If n < m this is
| asymptotically better. The binary searches will have poor locality of
| reference and deleting rows from a temporary table that has spilled to
| disk may be expensive, so in practice sorting just one input may only be
| better when the two inputs have significantly different sizes. Perhaps
| in the future we can implement both methods and pick the better one at
| optimize time.
|
| I wanted to make an initial implementation that is simple, handles both
| INTERSECT and EXCEPT, and that fits into the current Derby
| implementation fairly easily. If the performance of INTERSECT or EXCEPT
| proves to be important then we should revisit this.
|
| In fact there are a number of  related implementation alternatives that
| may be optimal under different conditions.
| 1. Just sort the smaller input and use a binary search to see if each
| row in the unsorted input is also in the sorted input. INTERSECT is
| commutative, so it does not matter which of its inputs is sorted. EXCEPT
| is not commutative. If we sort its left input and scan through the
| unsorted right input we have to remove/mark the intersection rows in the
| left (sorted) input and output the unremoved/marked rows at the end.
| 2. We can also handle duplicate elimination when just sorting one input
| by removing duplicates when doing the sort, but not removing/marking
| rows from the sorted input as they are found in the other input.
| 3. We can use a hash table instead of sorting. Unfortunately Derby does
| not yet have a hash table implementation that spills to disk.
|
| Jack
| Jack Klebanoff
|
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.5 (MingW32)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iD8DBQFBue58EpeslyHqPs0RAoQZAKDHt4+qtnBU90rbKM+BgFcKTm+s2QCfSxfz
b/ooqn8eCrOiM0cKcivk1bU=
=+WdC
-----END PGP SIGNATURE-----

Mime
View raw message