couchdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Filipe David Manana <>
Subject Re: Understanding the CouchDB file format
Date Tue, 20 Dec 2011 20:37:30 GMT
On Tue, Dec 20, 2011 at 8:27 PM, Riyad Kalla <> wrote:
> Filipe,
> Thank you for the reply.
> Maybe I am misunderstanding exactly what couch is writing out; the docs
> I've read say that it "rewrites the root node" -- I can't tell if the docs
> mean the parent node of the child doc that was changed (as one of the b+
> leaves) or if it means the direct path, from the root, to that individual
> doc... or if it means the *entire* index...
> In the case of even rewriting the single parent, with such a shallow tree,
> each internal leaf will have a huge fan of nodes; let's say 1-10k in a
> decent sized data set.

The leaf nodes don't contain the documents, but rather pointers (file
offsets) to where the documents are stored in the file.
Non-leaf nodes contain a list of pointers to child nodes.

Updating a document means changing a value in its leaf node, writing
the modified version to disk, than update its parent node to refer to
the offset of the new leaf node, the parent of the parent, etc until
you write a new root node.

Maybe I misunderstood your doubt, but that doesn't seem a lot to me
(remember each node is typically about 1.2K).

However a doc update implies updating 2 btrees (seq and id trees).

> If you are seeing a few K of extra written out after each changed doc then
> that cannot be write... I almost assumed my understanding was wrong because
> the sheer volume of data would make performance abysmal if it was true.
> Given that... is it just the changed path, from the root to the new leaf
> that is rewritten? That makes me all sorts of curious as to how Couch
> updates/searches the new modified index with the small diff that is written
> out.
> Any pointers to reading that will help me dig down on this (even early bugs
> in JIRA?) would be appreciated. I've tried skimming back in 2007/08 on
> Damien's blog to see if it wrote about it in depth and so far haven't found
> anything as detailed as I am hoping for on this architecture.

Just read couch_btree.erl, couch_file.erl and couch_db_updater.erl.

> Best,
> Riyad
> On Tue, Dec 20, 2011 at 1:07 PM, Filipe David Manana <>wrote:
>> On Tue, Dec 20, 2011 at 6:24 PM, Riyad Kalla <> wrote:
>> > I've been reading everything I can find on the CouchDB file format[1] and
>> > am getting bits and pieces here and there, but not a great, concrete,
>> > step-by-step explanation of the process.
>> >
>> > I'm clear on the use of B+ trees and after reading a few papers on the
>> > benefits of log-structured file formats, I understand the benefits of
>> > inlining the B+ tree indices directly into the data file as well
>> (locality
>> > + sequential I/O)... what I'm flummoxed about is how much of the B+
>> tree's
>> > index is rewritten after every modified document.
>> >
>> > Consider a CouchDB file that looks more or less like this:
>> >
>> > [idx/header][doc1, rev1][idx/header][doc1, rev2]....
>> >
>> > After each revised doc is written and the "b-tree root" is rewritten
>> after
>> > that, is that just a modified root node of the B+ tree or the entire B+
>> > tree?
>> >
>> > The reason I ask is because regardless of the answer to my previous
>> > question, for a *huge* database will millions of records, that seems like
>> > an enormous amount of data to rewrite after every modification. Say the
>> > root node had a fanning factor of 133; that would still be alot of data
>> to
>> > rewrite.
>> Hi Riyad,
>> Have you observed that in practice?
>> Typically the depth of database btrees is not that high even for
>> millions of documents. For example I have one around with about 10
>> million documents which doesn't have more than 5 or 6 levels if I
>> recall correctly.
>> So updating a doc, for that particular case, means rewriting 5 or 6
>> new nodes plus the document itself. Each node is normally not much
>> bigger than 1.2Kb.
>> I've written once a tool to analyze database files which reports btree
>> depths, however it's not updated to work with recent changes on
>> master/1.2.x such as snappy compression and btree sizes:
>> It should work with CouchDB 1.1 (and older) database files.
>> >
>> > I am certain I am missing the boat on this; if anyone can pull me out of
>> > the water and point me to dry land I'd appreciate it.
>> >
>> > Best,
>> > R
>> >
>> >
>> >
>> > [1]
>> > --
>> >
>> > --
>> > --
>> > --
>> > --* (Over my head)
>> --
>> Filipe David Manana,
>> "Reasonable men adapt themselves to the world.
>>  Unreasonable men adapt the world to themselves.
>>  That's why all progress depends on unreasonable men."

Filipe David Manana,

"Reasonable men adapt themselves to the world.
 Unreasonable men adapt the world to themselves.
 That's why all progress depends on unreasonable men."

View raw message