couchdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paul Davis <>
Subject Re: How are key values stored in B+ tree nodes on-disk?
Date Tue, 24 Jan 2012 03:49:37 GMT
On Mon, Jan 23, 2012 at 9:28 PM, Riyad Kalla <> wrote:
> <resending, Gmail is telling me this never sent...>
> Randall and Paul, I appreciate the explanation and links to the exact spots
> in code. This helped a lot.
> I've been reading up on alternative database impls; InnoDB, for example,
> caps the key length at some upper bound and then truncates the key and puts
> the remainder in an overflow table; naturally this requires 1 more random
> I/O every time you match a long key. At the extreme case (where every key
> doesn't fit and parts of it are in the overflow table) I think you end up
> with key-table idea you were toying with Paul, or something related.
> Paul, I feel your pain re: alternative approaches to this. If you can do
> away with RANGE queries, moving to MD5 or SHA hashes (or something even
> faster like MurmurHash) is a nice way to get consistently sized keys which
> gives you the advantage of paging in a block in raw bytes and only
> deserializing the 16 or 20-byte chunks in the positions of a standard
> binary-search until you find what you want instead of the whole node.
> I *really* want to find a way to make that work, but people seem to like
> RANGE queries.
> I asked this question on SO and a gentleman suggested a header at the
> beginning of the block that actually lists the offsets of all the keys in
> the node. For node/block sizes of 32KB of smaller (which seem reasonable),
> you could safely get away with a 16-bit int for storing the offsets;
> assuming a typical key size, I imagine you wouldn't have more than 100 or
> 200 keys in a node; assume 150, that is 300 bytes worth of offsets at the
> beginning of the block. So we spend 7% (assuming 4096kb block size) of the
> block providing offsets to help make key searching faster. If you were
> willing to cap blocks to no more than 255 keys you could use 8-bit offset
> values (meaning any key smaller than 8 bytes would get you in trouble).
> A limitation to this idea though is that when you need to overflow a single
> block because you are indexing so many or such large keys, you now end up
> with two blocks next to each other with 1 of them having no header.
> You can fix this by having a hard-set-block-size for the DB (say 4KB) that
> never changes and can never be configured. That way every block,
> everywhere, has a very tiny "type" header; maybe 8-bit. If the "normal" bit
> is set, then that block would have an offset list at the beginning of it
> before the key data. If the "overflow" bit was set, than that block is a
> continuation of the block before it, and right after the 8-bit block header
> is raw key data that needs to be considered.
> Just an idea, I am sure there are more elegant ways to solve this. I am
> more curious if there is something here... something interesting and worth
> considering, or if I am just enjoying the smell of my own farts over here ;)
> -R

I would say that the important constraints in changing this is that
you can't change the existing behavior without a *very* good reason.
For instance, there's currently no limit on the key length in the
btree. Changing this (even to something "long" like 65535 bytes) is
still a behavior change. And removing range queries is pretty much out
without changing a major API endpoint.

That said I invite you to play with this code. I've spent a good deal
of time trying to make it better but kept coming up short when
measuring performance (and its possible that my measurements weren't
completely representative either). If you come up with anything give
me a holler and I'd love to go through and see what you come up with.

View raw message