couchdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paul Davis <>
Subject Re: How are key values stored in B+ tree nodes on-disk?
Date Mon, 23 Jan 2012 08:39:50 GMT
On Mon, Jan 23, 2012 at 2:26 AM, Randall Leeds <> wrote:
> On Sun, Jan 22, 2012 at 04:49, Riyad Kalla <> wrote:
>> In my ever-growing need to ask the most esoteric file-format questions
>> possible, I am curious if anyone can point me in the direction of the
>> answer here...
>> Background
>> -----------------
>> Consider the _id and _seq B+ indices whose change-paths are appended to the
>> end of the Couch data file after every update. Each of the nodes written
>> have a basic structure of [key, value] where the "key" is the _id or _seq
>> value being indexed, and the "value" is an offset pointer to the next node
>> in the tree to seek to (or in the case of a leaf, an offset directly into
>> the datafile where the updated record lives).
>> General
>> -----------
>> Efficiently searching the keys stored in a node of a B+ tree after it has
>> been paged in require (let's assume) that the values are in sorted order;
>> then the keys can be binary-searched to find the next offset or block ID
>> on-disk to jump to.
>> To be able to efficiently jump around key values like that, they must all
>> be of the same size. For example, let's say we store 133 keys per node
>> (assume it is full). That means our first comparison after paging in that
>> block should be at the 67th key. The only way I can jump to that position
>> is to know every key value is say 16 bytes and then jump forward 67 * 16 =
>> 1072 bytes before decoding the key from raw bytes and comparing.
>> Question
>> -------------
>> Given the massive potential for document _id values to vary in size, it
>> seems impossible to have a hard-coded "key size" that couch utilizes here.
>> Even something sufficiently large like 32 or 64 bytes would still run into
>> situations where a longer key was being indexed or a *very short* key is
>> being indexed and there are multiple magnitudes of wasted space in the
>> index.
>> The few solutions I've found to variable-length keys online require a
>> sequential processing of the keys in a node because the keys will be
>> written like:
>> [key length][key data][offset]
>> This is certainly doable, but less than optimal.
>> I am curious how CouchDB handles this?
>> Thank you for any pointers.
>> -R
> The entire node is read into memory and deserialized from the Erlang
> term storage format into a list.
> The result is still a typical tree search, in which CouchDB can scan
> an inner nodes key pointers to determine which child(ren) (if any)
> might have a particular key, but within a particular inner node that
> scan is linear.
> See here:
> -Randall

Basically, yes.

You're quite right. Most of the literature on these things either hand
waves variable key lengths, or spends an entire thesis on how to
encode them using prefix elision with tries or other fancy approaches.

One of the deviations from literature that CouchDB has is that it
doesn't do fancy things here. It just looks at the binary size of a
node and splits on that. So for instance, we don't have a set number
of key's per node like a traditional b+tree. And the pedants will
basically say, "its no longer a b+tree" at this point which is true.
I've always referred to our "b+tree" implementation as more of a
"b~tree" implementation.

Although I've also spent a good deal of time considering various
alternative implementations that bring us back to a traditional b+tree
implementation (ie, storing each key on disk, and then each node
stores the offset of the key) but these experiments have invariably
ended up performing considerably worse in some crappy benchmarks I
wrote at the time.

In summary, its kind of a dirty engineering approach as opposed to a
beautiful theoretical approach though there's not too much in terms of
beautiful theory on variable sized key/values in b+trees.

Also, to add to Ranall's description, when we go to *write* a node, we
basically, serialize incrementally and then squeeze off a new node
when a threshold is met. Ie, if you were to be silly and update 1M
nodes at a time, we basically start popping off KVs until they exceed
a single node's threshold, then write, and then move on and
recursively build out the tree.

If you're really interested in the heart of the matter, the single
function you need to look at is:

Also, the comment is quite informative on the situation.

View raw message