The optimizer decides when to implement a join using hash joins. If it
estimates that one side of the join is small enough to fit in memory it
may choose a hash join.
The nature of estimates is that sometimes they are wrong. This is true
for the optimizer's size estimates. Therefore, sometimes the hash table
created for a hash join does not fit in memory. Currently the result is
that the JVM terminates with an OutOfMemoryError, or it thrashes. See
http://nagoya.apache.org/jira/browse/DERBY106.
I think that Derby should handle underestimates more gracefully.
When a hash join is executed a BackingStoreHashtable object is
constructed to implement the join. One of its constructor parameters is
"max_inmemory_rowcnt", the maximum number of rows to insert into the
inmemory hash table before overflowing to disk. Despite this
constructor parameter (and the class name) the current implementation of
BackingStoreHashtable never spills to disk. Hence Jira 106.
I propose that we change BackingStoreHashtable to spill to disk.
The goals of the change are as follows:
1. Allow hash joins to work on any input size, up to the amount of
available disk.
2. The execution time for hash joins should be reasonable even if the
size estimates prove to be low.
3. Have at most a moderate impact on Derby's size.
4. Use a low amount of development effort.
The basic idea is as follows. When the number of rows in a
BackingStoreHashtable exceeds max_inmemory_rowcnt a disk based
associative data structure will be created and all new rows will be
inserted into the disk structure. Existing rows will be left in memory.
When BackingStoreHashtable is asked to find rows with a given key it
will first look in the in memory Hashtable, then in the disk structure.
If duplicate keys are allowed in the BackingStoreHashtable then it must
look in both places.
(There are several reasonable variations on this scheme dealing with the
movement of rows between the inmemory hash table and the disk
structure. They will be discussed later. However, I think that the
simplest scheme is best).
There are two main alternatives for implementing the disk based
associative data structure. One is to use Btrees, the other is to
implement a dynamic hash table.
Btree Implementation

In the Btree implementation we essentially create a temporary table, a
HeapConglomerate, to hold the overflow rows and create a Btree index on
the key columns. The only hitch is that our Btree implementation has a
limit on the total size of a key. Therefore, instead of indexing on the
key columns we will index on the key hash code.
In order to search for all the rows with a given key we compute the key
hash code and scan the Btree for the locations of all the rows with that
hash code. We then fetch the candidate rows from the HeapConglomerate
and return those with matching keys.
This implementation can use the existing heap and Btree conglomerate
implementations, so it should not increase the Derby footprint by much,
nor should it take much development time.
Hash Table Implementation

In the hash table implementation we write a new conglomerate type that
implements a dynamic hash table on disk. I would use the dynamic hash
algorithm of P.A. Larson, "Dynamic Hash Tables", Communications of the
ACM, 31(1988). With this algorithm a hash table of N rows is built in
time O(N*log(N)) and only one disk access is needed to find a row if all
the rows of its hash bucket fit on one page. Furthermore it grows the
hash table by one bucket at a time, as needed, so it does not waste very
much space.
The hash conglomerate would be implemented using two containers: one for
the first page of each bucket and the second for overflow pages. Two
containers are necessary so that we can use the hash code as a page
number to find the first page of a bucket. We cannot put overflow pages
in the primary container because this mapping would break if we had to
increase the number of buckets after creating an overflow page.
The overflow container is used when a bucket has too many rows to fit on
one page and when a row is too large to fit comfortably in a page. A
long row is represented as a (big) hash code and a pointer to the
overflow page holding the long row. Access to a long row will always
take at least two disk accesses, unless one of the pages is in the cache.
Note that there are two related hash functions in use, a big one and a
small one. The big hash function produces a 32 bit hash code, the one
used by the standard in memory hash table. The small hash code is the
low order n bits of the big hash code. The small hash code is used to
index the disk buckets. Under the Larson algorithm there are between
2**(n1)+1 and 2**n buckets.
Comparison

We are interested in 4 measures:
1. the time required to get all the rows with a given key,
2. the time required to build the data structure,
3. the disk space used by the data structure, and
4. the amount of new code (this impacts both development time and Derby
footprint).
Access Time

In order to find rows using the Btree implementation we first traverse
the Btree to get a list of all rows with the same big hash code. Then we
access the heap pages referenced by the Btree and discard the rows whose
keys do not match. The rows will probably not be clustered in the heap,
so there is likely to be one disk access per reference. Some number of
Btree nodes will be accessed, though the top nodes are likely to be in
the page cache. So the number of disk accesses is likely to be about 1 +
N(K), where N(K) is the number of rows with the same big hash code value
as key K.
In order to find rows using the disk hash implementation we just
traverse the pages in the bucket corresponding to key K. The number of
page accesses is ceil(n(K)/b), where n(K) is the number of rows with the
same small hash code as key K and b is the number of rows per page.
So the disk hash implementation will access rows more quickly when
n(K)/N(K) < b. That is, the disk hash implementation does better when
the row size is small relative to the page size, or when the small hash
function doesn't give up too much selectivity over the big hash function.
Build Time

Both implementations take O(N*log(N)) time to build in the worst case.
However there are some differences in the build times.
The Btree implementation will split Btree nodes as the structure is
built up. However, since the Btree is keyed on an integer hash code this
is not as expensive as shuffling the full rows. The Btree implementation
does not move the actual rows as the data structure is populated.
If the hash implementation's initial estimate of the number of buckets
is good or high then it does not shuffle rows around as the data
structure is populated and the build time is O(N), which is quite good.
If the initial bucket count estimate is low then the hash implementation
shuffles rows around as buckets are split. I expect that most rows are
be long enough so that this is more expensive than splitting Btree nodes.
So the build time of the hash implementation is better if the initial
size estimate is good, but it is probably worse than the Btree
implementation when the size estimate is bad.
Data Structure Size

In the Btree implementation all the "waste" space is in the Btree
conglomerate. Its heap conglomerate is about as compact as it can be.
Because the Btree keys on the hash code it is smaller than the heap
conglomerate, unless the rows are quite small.
The hash implementation wastes space when some of the buckets and/or
overflow pages are not full. This happens because we initially
overestimate the number of buckets required, because the small hash
function distributes the rows poorly, and/or because we choose a load
factor less than one.
So, sometimes the hash implementation uses less disk space for the data
structure and sometimes the Btree implementation uses less space. I
suspect that the Btree implementation is better in most cases.
Code Size

The Btree implementation can use the existing Btree and heap
conglomerate code while the hash implementation must work at a lower
level to implement a new conglomerate class. So I suspect that the hash
implementation will require more new code. I don't think that it either
implementation requires an enormous amount of new code.
Moving Rows Between Memory and Disk

BackingStoreHashtable starts off with a purely RAM based structure and,
under this proposal, spills to disk when it exceeds a size threshold. I
have proposed that once a row is placed in the RAM portion of the
structure it never be moved to disk; only new rows be placed on disk.
This has the virtue of simplicity. However, there are several reasonable
options that I will discuss for the sake of completeness.
When the disk structure is created we could move some or all of the rows
to disk. Mike Matrigali suggested moving all the rows to disk to reduce
Derby's RAM usage, reducing the likelihood of thrashing or an
OutOfMemoryError particularly in a multithread environment. However it
would also increase the hash table build time and the row fetch time.
Since Derby's BackingStoreHashtables are transient the data structure
build time is important. It sometimes be better in terms of memory usage
to build and use the hash table quickly so that it can be deleted quickly.
Keeping or eliminating duplicates is a BackingStoreHashtable constructor
option. When there are multiple rows with the same key
BackingStoreHashtable.get returns a Vector of all the rows. If
BackingStoreHashtable has started to spill to disk and we insert a row
with a key that duplicates the key of one or more rows in RAM we could
move all of the rows to disk. Then the BackingStoreHashtable.get method
would not have to look on disk when it finds a row in RAM. This would
improve the average get() time, at the cost of slowing down the
BackingStoreHashtable build time.
Conclusion

I recommend that we implement spilling using the Btree implementation
because it offers a clear advantage over the disk hash implementation in
footprint and development time while there is no clear advantage for
either implementation in terms of access time, build time, or data
structure size. The hash implementation has some advantage in access
time, build time, and space when the rows have a moderate size and our
initial size estimate is good. However we only spill to disk when our
size estimate is bad.
