couchdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Riyad Kalla <>
Subject Re: How are key values stored in B+ tree nodes on-disk?
Date Tue, 24 Jan 2012 03:28:57 GMT
<resending, Gmail is telling me this never sent...>

Randall and Paul, I appreciate the explanation and links to the exact spots
in code. This helped a lot.

I've been reading up on alternative database impls; InnoDB, for example,
caps the key length at some upper bound and then truncates the key and puts
the remainder in an overflow table; naturally this requires 1 more random
I/O every time you match a long key. At the extreme case (where every key
doesn't fit and parts of it are in the overflow table) I think you end up
with key-table idea you were toying with Paul, or something related.

Paul, I feel your pain re: alternative approaches to this. If you can do
away with RANGE queries, moving to MD5 or SHA hashes (or something even
faster like MurmurHash) is a nice way to get consistently sized keys which
gives you the advantage of paging in a block in raw bytes and only
deserializing the 16 or 20-byte chunks in the positions of a standard
binary-search until you find what you want instead of the whole node.

I *really* want to find a way to make that work, but people seem to like
RANGE queries.

I asked this question on SO and a gentleman suggested a header at the
beginning of the block that actually lists the offsets of all the keys in
the node. For node/block sizes of 32KB of smaller (which seem reasonable),
you could safely get away with a 16-bit int for storing the offsets;
assuming a typical key size, I imagine you wouldn't have more than 100 or
200 keys in a node; assume 150, that is 300 bytes worth of offsets at the
beginning of the block. So we spend 7% (assuming 4096kb block size) of the
block providing offsets to help make key searching faster. If you were
willing to cap blocks to no more than 255 keys you could use 8-bit offset
values (meaning any key smaller than 8 bytes would get you in trouble).

A limitation to this idea though is that when you need to overflow a single
block because you are indexing so many or such large keys, you now end up
with two blocks next to each other with 1 of them having no header.

You can fix this by having a hard-set-block-size for the DB (say 4KB) that
never changes and can never be configured. That way every block,
everywhere, has a very tiny "type" header; maybe 8-bit. If the "normal" bit
is set, then that block would have an offset list at the beginning of it
before the key data. If the "overflow" bit was set, than that block is a
continuation of the block before it, and right after the 8-bit block header
is raw key data that needs to be considered.

Just an idea, I am sure there are more elegant ways to solve this. I am
more curious if there is something here... something interesting and worth
considering, or if I am just enjoying the smell of my own farts over here ;)


On Mon, Jan 23, 2012 at 1:39 AM, Paul Davis <>wrote:

> On Mon, Jan 23, 2012 at 2:26 AM, Randall Leeds <>
> wrote:
> > On Sun, Jan 22, 2012 at 04:49, Riyad Kalla <> wrote:
> >> In my ever-growing need to ask the most esoteric file-format questions
> >> possible, I am curious if anyone can point me in the direction of the
> >> answer here...
> >>
> >> Background
> >> -----------------
> >> Consider the _id and _seq B+ indices whose change-paths are appended to
> the
> >> end of the Couch data file after every update. Each of the nodes written
> >> have a basic structure of [key, value] where the "key" is the _id or
> _seq
> >> value being indexed, and the "value" is an offset pointer to the next
> node
> >> in the tree to seek to (or in the case of a leaf, an offset directly
> into
> >> the datafile where the updated record lives).
> >>
> >> General
> >> -----------
> >> Efficiently searching the keys stored in a node of a B+ tree after it
> has
> >> been paged in require (let's assume) that the values are in sorted
> order;
> >> then the keys can be binary-searched to find the next offset or block ID
> >> on-disk to jump to.
> >>
> >> To be able to efficiently jump around key values like that, they must
> all
> >> be of the same size. For example, let's say we store 133 keys per node
> >> (assume it is full). That means our first comparison after paging in
> that
> >> block should be at the 67th key. The only way I can jump to that
> position
> >> is to know every key value is say 16 bytes and then jump forward 67 *
> 16 =
> >> 1072 bytes before decoding the key from raw bytes and comparing.
> >>
> >> Question
> >> -------------
> >> Given the massive potential for document _id values to vary in size, it
> >> seems impossible to have a hard-coded "key size" that couch utilizes
> here.
> >> Even something sufficiently large like 32 or 64 bytes would still run
> into
> >> situations where a longer key was being indexed or a *very short* key is
> >> being indexed and there are multiple magnitudes of wasted space in the
> >> index.
> >>
> >> The few solutions I've found to variable-length keys online require a
> >> sequential processing of the keys in a node because the keys will be
> >> written like:
> >> [key length][key data][offset]
> >>
> >> This is certainly doable, but less than optimal.
> >>
> >>
> >> I am curious how CouchDB handles this?
> >>
> >> Thank you for any pointers.
> >>
> >> -R
> >
> > The entire node is read into memory and deserialized from the Erlang
> > term storage format into a list.
> > The result is still a typical tree search, in which CouchDB can scan
> > an inner nodes key pointers to determine which child(ren) (if any)
> > might have a particular key, but within a particular inner node that
> > scan is linear.
> >
> > See here:
> >
> > -Randall
> Basically, yes.
> You're quite right. Most of the literature on these things either hand
> waves variable key lengths, or spends an entire thesis on how to
> encode them using prefix elision with tries or other fancy approaches.
> One of the deviations from literature that CouchDB has is that it
> doesn't do fancy things here. It just looks at the binary size of a
> node and splits on that. So for instance, we don't have a set number
> of key's per node like a traditional b+tree. And the pedants will
> basically say, "its no longer a b+tree" at this point which is true.
> I've always referred to our "b+tree" implementation as more of a
> "b~tree" implementation.
> Although I've also spent a good deal of time considering various
> alternative implementations that bring us back to a traditional b+tree
> implementation (ie, storing each key on disk, and then each node
> stores the offset of the key) but these experiments have invariably
> ended up performing considerably worse in some crappy benchmarks I
> wrote at the time.
> In summary, its kind of a dirty engineering approach as opposed to a
> beautiful theoretical approach though there's not too much in terms of
> beautiful theory on variable sized key/values in b+trees.
> Also, to add to Ranall's description, when we go to *write* a node, we
> basically, serialize incrementally and then squeeze off a new node
> when a threshold is met. Ie, if you were to be silly and update 1M
> nodes at a time, we basically start popping off KVs until they exceed
> a single node's threshold, then write, and then move on and
> recursively build out the tree.
> If you're really interested in the heart of the matter, the single
> function you need to look at is:
> Also, the comment is quite informative on the situation.

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message