db-derby-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "RPost" <rp0...@pacbell.net>
Subject Re: BackingStoreHashtable
Date Tue, 25 Jan 2005 23:19:52 GMT
Not disagreeing with your conclusion per se but one question.

Does either 'spill to disk' approach have any other possible future use?
Perhaps for implementing any other features or functionality on anyone's
'nice to have' list? Or would either btree or hash implementations be useful
for only this one purpose?


----- Original Message ----- 
From: "Jack Klebanoff" <klebanof@Mutagen.Net>
To: "Derby Development" <derby-dev@db.apache.org>
Sent: Tuesday, January 25, 2005 12:15 PM
Subject: Re: BackingStoreHashtable


> The optimizer decides when to implement a join using hash joins. If it
> estimates that one side of the join is small enough to fit in memory it
> may choose a hash join.
>
> The nature of estimates is that sometimes they are wrong. This is true
> for the optimizer's size estimates. Therefore, sometimes the hash table
> created for a hash join does not fit in memory. Currently the result is
> that the JVM terminates with an OutOfMemoryError, or it thrashes.  See
> http://nagoya.apache.org/jira/browse/DERBY-106.
>
> I think that Derby should handle underestimates more gracefully.
>
> When a hash join is executed a BackingStoreHashtable object is
> constructed to implement the join. One of its constructor parameters is
> "max_inmemory_rowcnt", the maximum number of rows to insert into the
> in-memory hash table before overflowing to disk. Despite this
> constructor parameter (and the class name) the current implementation of
> BackingStoreHashtable never spills to disk. Hence Jira 106.
>
> I propose that we change BackingStoreHashtable to spill to disk.
>
> The goals of the change are as follows:
>
> 1. Allow hash joins to work on any input size, up to the amount of
> available disk.
>
> 2. The execution time for hash joins should be reasonable even if the
> size estimates prove to be low.
>
> 3. Have at most a moderate impact on Derby's size.
>
> 4. Use a low amount of development effort.
>
>
> The basic idea is as follows. When the number of rows in a
> BackingStoreHashtable exceeds max_inmemory_rowcnt a disk based
> associative data structure will be created and all new rows will be
> inserted into the disk structure. Existing rows will be left in memory.
> When BackingStoreHashtable is asked to find rows with a given key it
> will first look in the in memory Hashtable, then in the disk structure.
> If duplicate keys are allowed in the BackingStoreHashtable then it must
> look in both places.
>
> (There are several reasonable variations on this scheme dealing with the
> movement of rows between the in-memory hash table and the disk
> structure. They will be discussed later. However, I think that the
> simplest scheme is best).
>
> There are two main alternatives for implementing the disk based
> associative data structure. One is to use Btrees, the other is to
> implement a dynamic hash table.
>
> Btree Implementation
> --------------------
>
> In the Btree implementation we essentially create a temporary table, a
> HeapConglomerate, to hold the overflow rows and create a Btree index on
> the key columns. The only hitch is that our Btree implementation has a
> limit on the total size of a key. Therefore, instead of indexing on the
> key columns we will index on the key hash code.
>
> In order to search for all the rows with a given key we compute the key
> hash code and scan the Btree for the locations of all the rows with that
> hash code. We then fetch the candidate rows from the HeapConglomerate
> and return those with matching keys.
>
> This implementation can use the existing heap and Btree conglomerate
> implementations, so it should not increase the Derby footprint by much,
> nor should it take much development time.
>
> Hash Table Implementation
> -------------------------
>
> In the hash table implementation we write a new conglomerate type that
> implements a dynamic hash table on disk. I would use the dynamic hash
> algorithm of P.-A. Larson, "Dynamic Hash Tables", Communications of the
> ACM, 31(1988). With this algorithm a hash table of N rows is built in
> time O(N*log(N)) and only one disk access is needed to find a row if all
> the rows of its hash bucket fit on one page. Furthermore it grows the
> hash table by one bucket at a time, as needed, so it does not waste very
> much space.
>
> The hash conglomerate would be implemented using two containers: one for
> the first page of each bucket and the second for overflow pages. Two
> containers are necessary so that we can use the hash code as a page
> number to find the first page of a bucket. We cannot put overflow pages
> in the primary container because this mapping would break if we had to
> increase the number of buckets after creating an overflow page.
>
> The overflow container is used when a bucket has too many rows to fit on
> one page and when a row is too large to fit comfortably in a page. A
> long row is represented as a (big) hash code and a pointer to the
> overflow page holding the long row. Access to a long row will always
> take at least two disk accesses, unless one of the pages is in the cache.
>
> Note that there are two related hash functions in use, a big one and a
> small one. The big hash function produces a 32 bit hash code, the one
> used by the standard in memory hash table. The small hash code is the
> low order n bits of the big hash code. The small hash code is used to
> index the disk buckets. Under the Larson algorithm there are between
> 2**(n-1)+1 and 2**n buckets.
>
>
> Comparison
> ----------
>
> We are interested in 4 measures:
> 1. the time required to get all the rows with a given key,
> 2. the time required to build the data structure,
> 3. the disk space used by the data structure, and
> 4. the amount of new code (this impacts both development time and Derby
> footprint).
>
> Access Time
> ----------
>
> In order to find rows using the Btree implementation we first traverse
> the Btree to get a list of all rows with the same big hash code. Then we
> access the heap pages referenced by the Btree and discard the rows whose
> keys do not match. The rows will probably not be clustered in the heap,
> so there is likely to be one disk access per reference. Some number of
> Btree nodes will be accessed, though the top nodes are likely to be in
> the page cache. So the number of disk accesses is likely to be about 1 +
> N(K), where N(K) is the number of rows with the same big hash code value
> as key K.
>
> In order to find rows using the disk hash implementation we just
> traverse the pages in the bucket corresponding to key K. The number of
> page accesses is ceil(n(K)/b), where n(K) is the number of rows with the
> same small hash code as key K and b is the number of rows per page.
>
> So the disk hash implementation will access rows more quickly when
> n(K)/N(K) < b. That is, the disk hash implementation does better when
> the row size is small relative to the page size, or when the small hash
> function doesn't give up too much selectivity over the big hash function.
>
> Build Time
> ----------
>
> Both implementations take O(N*log(N)) time to build in the worst case.
> However there are some differences in the build times.
>
> The Btree implementation will split Btree nodes as the structure is
> built up. However, since the Btree is keyed on an integer hash code this
> is not as expensive as shuffling the full rows. The Btree implementation
> does not move the actual rows as the data structure is populated.
>
> If the hash implementation's initial estimate of the number of buckets
> is good or high then it does not shuffle rows around as the data
> structure is populated and the build time is O(N), which is quite good.
> If the initial bucket count estimate is low then the hash implementation
> shuffles rows around as buckets are split. I expect that most rows are
> be long enough so that this is more expensive than splitting Btree nodes.
>
> So the build time of the hash implementation is better if the initial
> size estimate is good, but it is probably worse than the Btree
> implementation when the size estimate is bad.
>
> Data Structure Size
> -------------------
>
> In the Btree implementation all the "waste" space is in the Btree
> conglomerate. Its heap conglomerate is about as compact as it can be.
> Because the Btree keys on the hash code it is smaller than the heap
> conglomerate, unless the rows are quite small.
>
> The hash implementation wastes space when some of the buckets and/or
> overflow pages are not full. This happens because we initially
> overestimate the number of buckets required, because the small hash
> function distributes the rows poorly, and/or because we choose a load
> factor less than one.
>
> So, sometimes the hash implementation uses less disk space for the data
> structure and sometimes the Btree implementation uses less space. I
> suspect that the Btree implementation is better in most cases.
>
> Code Size
> ---------
>
> The Btree implementation can use the existing Btree and heap
> conglomerate code while the hash implementation must work at a lower
> level to implement a new conglomerate class. So I suspect that the hash
> implementation will require more new code. I don't think that it either
> implementation requires an enormous amount of new code.
>
>
> Moving Rows Between Memory and Disk
> -----------------------------------
>
> BackingStoreHashtable starts off with a purely RAM based structure and,
> under this proposal, spills to disk when it exceeds a size threshold. I
> have proposed that once a row is placed in the RAM portion of the
> structure it never be moved to disk; only new rows be placed on disk.
> This has the virtue of simplicity. However, there are several reasonable
> options that I will discuss for the sake of completeness.
>
> When the disk structure is created we could move some or all of the rows
> to disk. Mike Matrigali suggested moving all the rows to disk to reduce
> Derby's RAM usage, reducing the likelihood of thrashing or an
> OutOfMemoryError particularly in a multi-thread environment. However it
> would also increase the hash table build time and the row fetch time.
> Since Derby's BackingStoreHashtables are transient the data structure
> build time is important. It sometimes be better in terms of memory usage
> to build and use the hash table quickly so that it can be deleted quickly.
>
> Keeping or eliminating duplicates is a BackingStoreHashtable constructor
> option. When there are multiple rows with the same key
> BackingStoreHashtable.get returns a Vector of all the rows. If
> BackingStoreHashtable has started to spill to disk and we insert a row
> with a key that duplicates the key of one or more rows in RAM we could
> move all of the rows to disk. Then the BackingStoreHashtable.get method
> would not have to look on disk when it finds a row in RAM. This would
> improve the average get() time, at the cost of slowing down the
> BackingStoreHashtable build time.
>
>
> Conclusion
> -----------
>
> I recommend that we implement spilling using the Btree implementation
> because it offers a clear advantage over the disk hash implementation in
> footprint and development time while there is no clear advantage for
> either implementation in terms of access time, build time, or data
> structure size. The hash implementation has some advantage in access
> time, build time, and space when the rows have a moderate size and our
> initial size estimate is good. However we only spill to disk when our
> size estimate is bad.
>


Mime
View raw message