db-derby-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mike Matrigali <mikem_...@sbcglobal.net>
Subject Re: BackingStoreHashtable
Date Fri, 10 Dec 2004 19:09:54 GMT
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

and some more history, originally when the overflow portion of
BackingStoreHashTable was considered more of an error path a suggested
implementation was to just use the existing derby temporary btree table
implementation to implement the overflow portion of the class.  The
key would be the hash key, and the data the rest of row of the result set.

benefits are it would be easy to implement, be a lot less code to
maintain, and reuse existing technology.  It already has full support
for bounded access to duplicates.

Access performance would likely be worse than a hash table assuming
single I/O access to a given key.  Searches would have to traverse the
in memory cached index nodes.

Does anyone know if other dbms's provide a hash index for SQL indexes?
I know CA-ingres did when I looked at it many years ago - but it was
static hashing, so table had to modified if too much overflow happened.

Jack Klebanoff wrote:
| Derby uses class BackingStoreHashtable to implement a hash table used in
| hash joins. Despite the class's name it does not spill to disk; the hash
| table is always in memory.
|
| The optimizer tries to avoid hash joins when the input is large because
| BackingStoreHashtable would use too much memory. The optimizer just
| works with row count estimates, so there is no guarantee that
| BackingStoreHashtable will not blow up. No customer has complained about
| this yet.
|
| I would like to work on this, changing BackingStoreHashtable to spill to
| disk when the hash table gets large, and changing the optimizer to
| understand that large hash tables are allowed, but more costly.
|
| The advantages of doing this are that
| 1. hash joins will not blow up when the compiler's row count estimate
| was low,
| 2. the optmizer can choose hash joins on large inputs when that is the
| least costly join implementation, and
| 3. we can use BackingStoreHashtable to implement features such as
| INTERSECT and GROUP BY, though I am not proposing to do so now.
|
| I am not proposing to implement a hash table that can be used as an
| alternative to Btrees for a secondary index. BackingStoreHashtable is
| used for transient data structures that are only accessed by a single
| thread. A secondary index implementation must deal with locking and must
| implement hashCode methods that are JVM independent. This is much more
| work and would yield a slower implementation.
|
| I propose that BackingStoreHashtable should start off using an in-memory
| hash table even if the estimated row count is large. That way it will
| use an in-memory hash table when the actual row count is small enough
| for the table to fit in memory. When it finds that spilling to disk is
| necessary BackingStoreHashtable will use the estimated row count to
| determine the initial number of buckets and move the in-memory entries
| to disk. The disk based hash table will use a linear hashing algorithm,
| see "External Memory Algorithms and Data Structures: Dealing withMassive
| Data", Jeffrey Scott Vitter, ACM Computing Surveys, Vol. 33, No. 2, June
| 2001, pp. 209–271. It grows the hash table one bucket at a time when the
| average number of entries per bucket grows beyond a threshold. The
| disadvantage of growing by a larger number of buckets is that the last
| expansion may be unnecessarity large, wasting time and space.
|
| I would appreciate it if anyone can point me to a better external hash
| table algorithm.
|
| The disk hash table will use overflow pages because an imperfect hash
| function will cause some buckets to get more than their share of
| entries, and because we may have duplicate keys.
|
| I have not yet looked at how the optimizer handles hash joins, so I do
| not yet have a proposal for how to change that.
|
| Can anyone explain how to generate a cost for the Derby optimizer? How
| do you compute it from an estimated number of disk accesses? Does an
| estimate of 2 disk accesses mean a cost of 2.0? Should the cost estimate
| include CPU time? If so, how do you estimate it relative to disk access
| cost? Putting it another way, what does a cost estimate of 1.0 (or x)
mean?
|
| Comments, criticisms, questions?
|
| Jack Klebanoff
|
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.5 (MingW32)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iD8DBQFBufSBEpeslyHqPs0RAq8FAJ9IsBYdbcRmDcUQtnV/wJWHiFFQUwCg8p/z
IWx5PC40FBGl3bxU895xO0U=
=RCZK
-----END PGP SIGNATURE-----

Mime
View raw message