db-derby-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jack Klebanoff <kleba...@Mutagen.Net>
Subject Re: BackingStoreHashtable
Date Tue, 25 Jan 2005 20:15:53 GMT
The optimizer decides when to implement a join using hash joins. If it 
estimates that one side of the join is small enough to fit in memory it 
may choose a hash join.

The nature of estimates is that sometimes they are wrong. This is true 
for the optimizer's size estimates. Therefore, sometimes the hash table 
created for a hash join does not fit in memory. Currently the result is 
that the JVM terminates with an OutOfMemoryError, or it thrashes.  See 
http://nagoya.apache.org/jira/browse/DERBY-106.

I think that Derby should handle underestimates more gracefully.

When a hash join is executed a BackingStoreHashtable object is 
constructed to implement the join. One of its constructor parameters is 
"max_inmemory_rowcnt", the maximum number of rows to insert into the 
in-memory hash table before overflowing to disk. Despite this 
constructor parameter (and the class name) the current implementation of 
BackingStoreHashtable never spills to disk. Hence Jira 106.

I propose that we change BackingStoreHashtable to spill to disk.

The goals of the change are as follows:

1. Allow hash joins to work on any input size, up to the amount of 
available disk.

2. The execution time for hash joins should be reasonable even if the 
size estimates prove to be low.

3. Have at most a moderate impact on Derby's size.

4. Use a low amount of development effort.


The basic idea is as follows. When the number of rows in a 
BackingStoreHashtable exceeds max_inmemory_rowcnt a disk based 
associative data structure will be created and all new rows will be 
inserted into the disk structure. Existing rows will be left in memory. 
When BackingStoreHashtable is asked to find rows with a given key it 
will first look in the in memory Hashtable, then in the disk structure. 
If duplicate keys are allowed in the BackingStoreHashtable then it must 
look in both places.

(There are several reasonable variations on this scheme dealing with the 
movement of rows between the in-memory hash table and the disk 
structure. They will be discussed later. However, I think that the 
simplest scheme is best).

There are two main alternatives for implementing the disk based 
associative data structure. One is to use Btrees, the other is to 
implement a dynamic hash table.

Btree Implementation
--------------------

In the Btree implementation we essentially create a temporary table, a 
HeapConglomerate, to hold the overflow rows and create a Btree index on 
the key columns. The only hitch is that our Btree implementation has a 
limit on the total size of a key. Therefore, instead of indexing on the 
key columns we will index on the key hash code.

In order to search for all the rows with a given key we compute the key 
hash code and scan the Btree for the locations of all the rows with that 
hash code. We then fetch the candidate rows from the HeapConglomerate 
and return those with matching keys.

This implementation can use the existing heap and Btree conglomerate 
implementations, so it should not increase the Derby footprint by much, 
nor should it take much development time.

Hash Table Implementation
-------------------------

In the hash table implementation we write a new conglomerate type that 
implements a dynamic hash table on disk. I would use the dynamic hash 
algorithm of P.-A. Larson, "Dynamic Hash Tables", Communications of the 
ACM, 31(1988). With this algorithm a hash table of N rows is built in 
time O(N*log(N)) and only one disk access is needed to find a row if all 
the rows of its hash bucket fit on one page. Furthermore it grows the 
hash table by one bucket at a time, as needed, so it does not waste very 
much space.

The hash conglomerate would be implemented using two containers: one for 
the first page of each bucket and the second for overflow pages. Two 
containers are necessary so that we can use the hash code as a page 
number to find the first page of a bucket. We cannot put overflow pages 
in the primary container because this mapping would break if we had to 
increase the number of buckets after creating an overflow page.

The overflow container is used when a bucket has too many rows to fit on 
one page and when a row is too large to fit comfortably in a page. A 
long row is represented as a (big) hash code and a pointer to the 
overflow page holding the long row. Access to a long row will always 
take at least two disk accesses, unless one of the pages is in the cache.

Note that there are two related hash functions in use, a big one and a 
small one. The big hash function produces a 32 bit hash code, the one 
used by the standard in memory hash table. The small hash code is the 
low order n bits of the big hash code. The small hash code is used to 
index the disk buckets. Under the Larson algorithm there are between 
2**(n-1)+1 and 2**n buckets.


Comparison
----------

We are interested in 4 measures:
1. the time required to get all the rows with a given key,
2. the time required to build the data structure,
3. the disk space used by the data structure, and
4. the amount of new code (this impacts both development time and Derby 
footprint).

Access Time
----------

In order to find rows using the Btree implementation we first traverse 
the Btree to get a list of all rows with the same big hash code. Then we 
access the heap pages referenced by the Btree and discard the rows whose 
keys do not match. The rows will probably not be clustered in the heap, 
so there is likely to be one disk access per reference. Some number of 
Btree nodes will be accessed, though the top nodes are likely to be in 
the page cache. So the number of disk accesses is likely to be about 1 + 
N(K), where N(K) is the number of rows with the same big hash code value 
as key K.

In order to find rows using the disk hash implementation we just 
traverse the pages in the bucket corresponding to key K. The number of 
page accesses is ceil(n(K)/b), where n(K) is the number of rows with the 
same small hash code as key K and b is the number of rows per page.

So the disk hash implementation will access rows more quickly when 
n(K)/N(K) < b. That is, the disk hash implementation does better when 
the row size is small relative to the page size, or when the small hash 
function doesn't give up too much selectivity over the big hash function.

Build Time
----------

Both implementations take O(N*log(N)) time to build in the worst case. 
However there are some differences in the build times.

The Btree implementation will split Btree nodes as the structure is 
built up. However, since the Btree is keyed on an integer hash code this 
is not as expensive as shuffling the full rows. The Btree implementation 
does not move the actual rows as the data structure is populated.

If the hash implementation's initial estimate of the number of buckets 
is good or high then it does not shuffle rows around as the data 
structure is populated and the build time is O(N), which is quite good. 
If the initial bucket count estimate is low then the hash implementation 
shuffles rows around as buckets are split. I expect that most rows are 
be long enough so that this is more expensive than splitting Btree nodes.

So the build time of the hash implementation is better if the initial 
size estimate is good, but it is probably worse than the Btree 
implementation when the size estimate is bad.

Data Structure Size
-------------------

In the Btree implementation all the "waste" space is in the Btree 
conglomerate. Its heap conglomerate is about as compact as it can be. 
Because the Btree keys on the hash code it is smaller than the heap 
conglomerate, unless the rows are quite small.

The hash implementation wastes space when some of the buckets and/or 
overflow pages are not full. This happens because we initially 
overestimate the number of buckets required, because the small hash 
function distributes the rows poorly, and/or because we choose a load 
factor less than one.

So, sometimes the hash implementation uses less disk space for the data 
structure and sometimes the Btree implementation uses less space. I 
suspect that the Btree implementation is better in most cases.

Code Size
---------

The Btree implementation can use the existing Btree and heap 
conglomerate code while the hash implementation must work at a lower 
level to implement a new conglomerate class. So I suspect that the hash 
implementation will require more new code. I don't think that it either 
implementation requires an enormous amount of new code.


Moving Rows Between Memory and Disk
-----------------------------------

BackingStoreHashtable starts off with a purely RAM based structure and, 
under this proposal, spills to disk when it exceeds a size threshold. I 
have proposed that once a row is placed in the RAM portion of the 
structure it never be moved to disk; only new rows be placed on disk. 
This has the virtue of simplicity. However, there are several reasonable 
options that I will discuss for the sake of completeness.

When the disk structure is created we could move some or all of the rows 
to disk. Mike Matrigali suggested moving all the rows to disk to reduce 
Derby's RAM usage, reducing the likelihood of thrashing or an 
OutOfMemoryError particularly in a multi-thread environment. However it 
would also increase the hash table build time and the row fetch time. 
Since Derby's BackingStoreHashtables are transient the data structure 
build time is important. It sometimes be better in terms of memory usage 
to build and use the hash table quickly so that it can be deleted quickly.

Keeping or eliminating duplicates is a BackingStoreHashtable constructor 
option. When there are multiple rows with the same key 
BackingStoreHashtable.get returns a Vector of all the rows. If 
BackingStoreHashtable has started to spill to disk and we insert a row 
with a key that duplicates the key of one or more rows in RAM we could 
move all of the rows to disk. Then the BackingStoreHashtable.get method 
would not have to look on disk when it finds a row in RAM. This would 
improve the average get() time, at the cost of slowing down the 
BackingStoreHashtable build time.


Conclusion
-----------

I recommend that we implement spilling using the Btree implementation 
because it offers a clear advantage over the disk hash implementation in 
footprint and development time while there is no clear advantage for 
either implementation in terms of access time, build time, or data 
structure size. The hash implementation has some advantage in access 
time, build time, and space when the rows have a moderate size and our 
initial size estimate is good. However we only spill to disk when our 
size estimate is bad.


Mime
View raw message