couchdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Riyad Kalla <>
Subject How are key values stored in B+ tree nodes on-disk?
Date Sun, 22 Jan 2012 12:49:12 GMT
In my ever-growing need to ask the most esoteric file-format questions
possible, I am curious if anyone can point me in the direction of the
answer here...

Consider the _id and _seq B+ indices whose change-paths are appended to the
end of the Couch data file after every update. Each of the nodes written
have a basic structure of [key, value] where the "key" is the _id or _seq
value being indexed, and the "value" is an offset pointer to the next node
in the tree to seek to (or in the case of a leaf, an offset directly into
the datafile where the updated record lives).

Efficiently searching the keys stored in a node of a B+ tree after it has
been paged in require (let's assume) that the values are in sorted order;
then the keys can be binary-searched to find the next offset or block ID
on-disk to jump to.

To be able to efficiently jump around key values like that, they must all
be of the same size. For example, let's say we store 133 keys per node
(assume it is full). That means our first comparison after paging in that
block should be at the 67th key. The only way I can jump to that position
is to know every key value is say 16 bytes and then jump forward 67 * 16 =
1072 bytes before decoding the key from raw bytes and comparing.

Given the massive potential for document _id values to vary in size, it
seems impossible to have a hard-coded "key size" that couch utilizes here.
Even something sufficiently large like 32 or 64 bytes would still run into
situations where a longer key was being indexed or a *very short* key is
being indexed and there are multiple magnitudes of wasted space in the

The few solutions I've found to variable-length keys online require a
sequential processing of the keys in a node because the keys will be
written like:
[key length][key data][offset]

This is certainly doable, but less than optimal.

I am curious how CouchDB handles this?

Thank you for any pointers.


  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message