couchdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jan Lehnardt <>
Subject Re: Understanding the CouchDB file format
Date Thu, 22 Dec 2011 09:21:57 GMT
Good writeup! Would you consider contributing it to the CouchDB Wiki?


On Dec 21, 2011, at 21:28 , Riyad Kalla wrote:

> Bob,
> Really appreciate the link; Rick has a handful of articles that helped a
> lot.
> Along side all the CouchDB reading I've been looking at SSD-optimized data
> storage mechanisms and tried to coalesce all of this information into this
> post on Couch's file storage format:
> It is uncanny how many things Couch seems to have gotten right with regard
> to existing storage systems and future flash-based storage systems. I'd
> appreciate any corrections, additions or feedback to the post for anyone
> interested.
> Best,
> R
> On Wed, Dec 21, 2011 at 12:53 PM, Robert Dionne <
>> wrote:
>> I think this is largely correct Riyad, I dug out an old article[1] by Rick
>> Ho that you may also find helpful though it might be slightly dated.
>> Generally the best performance will be had if the ids are sequential and
>> updates are done in bulk. Write heavy applications will eat up a lot of
>> space and require compaction. At the leaf nodes what are stored are either
>> full_doc_info records or doc_info records which store pointers to the data
>> so the main thing that impacts the branching at each level are the key size
>> and in the case of views the sizes of the reductions as these are stored
>> with the intermediate nodes.
>> All in all it works pretty well but as always you need to test and
>> evaluate it for you specific case to see what the limits are.
>> Regards,
>> Bob
>> [1]
>> On Dec 21, 2011, at 2:17 PM, Riyad Kalla wrote:
>>> Adding to this conversation, I found this set of slides by Chris
>> explaining
>>> the append-only index update format:
>>> Specifically slides 16, 17 and 18.
>>> Using this example tree, rewriting the updated path (in reverse order)
>>> appended to the end of the file makes sense... you see how index queries
>>> can simply read backwards from the end of the file and not only find the
>>> latest revisions of docs, but also every other doc that wasn't touched
>> (it
>>> will just seek into the existing inner nodes of the b+ tree for
>> searching).
>>> What I am hoping for clarification on is the following pain-point that I
>>> perceive with this approach:
>>> 1. In a sufficiently shallow B+ tree (like CouchDB), the paths themselves
>>> to elements are short (typically no more than 3 to 5 levels deep) as
>>> opposed to a trie or some other construct that would have much longer
>> paths
>>> to elements.
>>> 2. Because the depth of the tree is so shallow, the breadth of it becomes
>>> large to compensate... more specifically, each internal node can have
>> 100s,
>>> 1000s or more children. Using the example slides, consider the nodes
>>> [A...M] and [R...Z] -- in a good sized CouchDB database, those internal
>>> index nodes would have 100s (or more) elements in them pointing at deeper
>>> internal nodes that themselves had thousands of elements; instead of the
>> 13
>>> or so as implied by [A...M].
>>> 3. Looking at slide 17 and 18, where you see the direct B+ tree path to
>> the
>>> update node getting appended to the end of the file after the revision is
>>> written (leaf to root ordering: [J' M] -> [A M] -> [A Z]) it implies that
>>> those internal nodes with *all* their child elements are getting
>> rewritten
>>> as well.
>>> In this example tree, it is isn't such a big issue... but in a
>> sufficiently
>>> large CouchDB database, these nodes denoted by [A...M] and [A...Z] could
>> be
>>> quite large... I don't know the format of the node elements in the B+
>> tree,
>>> but it would be whatever the size of a node is times however many
>> elements
>>> are contained at each level (1 for root, say 100 for level 2, 1000 for
>>> level 3 and 10,000 for level 4 -- there is a lot of hand-waving going on
>>> here, of course it depends on the size of the data store).
>>> Am I missing something or is CouchDB really rewriting that much index
>>> information between document revisions on every update?
>>> What was previously confusing me is I thought it was *only* rewriting a
>>> direct path to the updated revision, like [B]>[E]>[J'] and Couch was
>>> some-how patching in that updated path info to the B+ index at runtime.
>>> If couch is rewriting entire node paths with all their elements then I am
>>> no longer confused about the B+ index updates, but am curious about the
>>> on-disk cost of this.
>>> In my own rough insertion testing, that would explain why I see my
>>> collections absolutely explode in size until they are compacted (not
>> using
>>> bulk insert, but intentionally doing single inserts for a million(s) of
>>> docs to see what kind of cost the index path duplication would be like).
>>> Can anyone confirm/deny/correct this assessment? I want to make sure I am
>>> on the right track understanding this.
>>> Best wishes,
>>> Riyad
>>> On Tue, Dec 20, 2011 at 6:13 PM, Riyad Kalla <> wrote:
>>>> @Filipe - I was just not clear on how CouchDB operated; you and Robert
>>>> cleared that up for me. Thank you.
>>>> @Robert - The writeup is excellent so far (I am not familiar with
>> erlang,
>>>> so there is a bit of stickiness there), thank you for taking the time to
>>>> put this together!
>>>> At this point I am curious how the _id and _seq indices are read as
>> their
>>>> data is continually appended to the end of the data file in small
>>>> diff-trees for every updated doc.
>>>> If CouchDB kept all the indices in-memory and simply patched-in the
>>>> updated paths at runtime (maybe something akin to memory-mapped indices
>> in
>>>> MongoDB) I would be fairly clear on the operation... but as I understand
>>>> it, CouchDB keeps such a small memory footprint by doing no in-memory
>>>> caching and relying on the intelligence of the OS and filesystem (and/or
>>>> drives) to cache frequently accessed data.
>>>> I am trying to understand the logic used by CouchDB to answer a query
>>>> using the index once updates to the tree have been appended to the data
>>>> file... for example, consider a CouchDB datastore like the one Filipe
>>>> has... 10 million documents and let's say it is freshly compacted.
>>>> If I send in a request to that Couch instance, it hits the header of the
>>>> data file along with the index and walks the B+ tree to the leaf node,
>>>> where it finds the offset into the data file where the actual doc
>> lives...
>>>> let's say 1,000,000 bytes away.
>>>> These B+ trees are shallow, so it might look something like this:
>>>> Level 1: 1 node, root node.
>>>> Level 2: 100 nodes, inner child nodes
>>>> Level 3: 10,000 nodes, inner child nodes
>>>> Level 4: 1,000,000, leaf nodes... all with pointers to the data offsets
>> in
>>>> the data file.
>>>> Now let's say I write 10 updates to documents in that file. There are 10
>>>> new revisions appended to the end of the data file *each one* separated
>> by
>>>> a rewritten B+ path to a leaf node with it's new location at the end of
>> the
>>>> file. Each of those paths written between each doc revision (say
>> roughly 2k
>>>> like Filipe mentioned) are just 4 item paths... root -> level1 ->
>> level2 ->
>>>> level3 --> level4... showing the discrete path from the root to that
>>>> individual updated doc. The intermediary levels (l1, 2, 3) are not fully
>>>> flushed out with all the OTHER children from the original b+ tree index.
>>>> [[ is this correct so far? If not, please point out my mistakes...]
>>>> Now I issue a query for a document that WAS NOT updated...
>>>> **** this is where I get confused on the logic ***
>>>> this would mean I need to access the original B+ tree index at the root
>> of
>>>> the data file, because the revised B+ paths that are written between
>> each
>>>> of the updated doc revisions at the end of the file are not full
>> indices.
>>>> NOW consider I want to query for one of the changed docs... now I
>> suddenly
>>>> need to scan backwards from the data file's end to find the updated
>> path to
>>>> the new revision of that document.
>>>> (obviously) this isn't what Couch is actually doing... it's doing
>>>> something more elegant, I just can't figure out what or how and that is
>>>> what I was hoping for help with.
>>>> Much thanks guys, I know this is a heavy question to ask.
>>>> Best wishes,
>>>> R
>>>> On Tue, Dec 20, 2011 at 1:35 PM, Robert Dionne <
>>>>> wrote:
>>>>> Robert Dionne
>>>>> Computer Programmer
>>>>> 203.231.9961
>>>>> On Dec 20, 2011, at 3:27 PM, Riyad Kalla wrote:
>>>>>> Filipe,
>>>>>> Thank you for the reply.
>>>>>> Maybe I am misunderstanding exactly what couch is writing out; the
>> docs
>>>>>> I've read say that it "rewrites the root node" -- I can't tell if
>>>>> docs
>>>>>> mean the parent node of the child doc that was changed (as one of
>> b+
>>>>>> leaves) or if it means the direct path, from the root, to that
>>>>> individual
>>>>>> doc... or if it means the *entire* index...
>>>>>> In the case of even rewriting the single parent, with such a shallow
>>>>> tree,
>>>>>> each internal leaf will have a huge fan of nodes; let's say 1-10k
in a
>>>>>> decent sized data set.
>>>>>> If you are seeing a few K of extra written out after each changed
>>>>> then
>>>>>> that cannot be write... I almost assumed my understanding was wrong
>>>>> because
>>>>>> the sheer volume of data would make performance abysmal if it was
>> true.
>>>>>> Given that... is it just the changed path, from the root to the new
>> leaf
>>>>>> that is rewritten?
>>>>> Hi Riyad,
>>>>> You are correct, it's only the changed path. Interestingly I've just
>>>>> started to document all these internals[1] along with links to the
>> code and
>>>>> other references available.
>>>>> Cheers,
>>>>> Bob
>>>>> [1]
>>>>>> That makes me all sorts of curious as to how Couch
>>>>>> updates/searches the new modified index with the small diff that
>>>>> written
>>>>>> out.
>>>>>> Any pointers to reading that will help me dig down on this (even
>>>>> bugs
>>>>>> in JIRA?) would be appreciated. I've tried skimming back in 2007/08
>>>>>> Damien's blog to see if it wrote about it in depth and so far haven't
>>>>> found
>>>>>> anything as detailed as I am hoping for on this architecture.
>>>>>> Best,
>>>>>> Riyad
>>>>>> On Tue, Dec 20, 2011 at 1:07 PM, Filipe David Manana <
>>>>>>> On Tue, Dec 20, 2011 at 6:24 PM, Riyad Kalla <>
>> wrote:
>>>>>>>> I've been reading everything I can find on the CouchDB file
>> format[1]
>>>>> and
>>>>>>>> am getting bits and pieces here and there, but not a great,
>> concrete,
>>>>>>>> step-by-step explanation of the process.
>>>>>>>> I'm clear on the use of B+ trees and after reading a few
papers on
>> the
>>>>>>>> benefits of log-structured file formats, I understand the
>> of
>>>>>>>> inlining the B+ tree indices directly into the data file
as well
>>>>>>> (locality
>>>>>>>> + sequential I/O)... what I'm flummoxed about is how much
of the B+
>>>>>>> tree's
>>>>>>>> index is rewritten after every modified document.
>>>>>>>> Consider a CouchDB file that looks more or less like this:
>>>>>>>> [idx/header][doc1, rev1][idx/header][doc1, rev2]....
>>>>>>>> After each revised doc is written and the "b-tree root" is
>>>>>>> after
>>>>>>>> that, is that just a modified root node of the B+ tree or
the entire
>>>>> B+
>>>>>>>> tree?
>>>>>>>> The reason I ask is because regardless of the answer to my
>>>>>>>> question, for a *huge* database will millions of records,
that seems
>>>>> like
>>>>>>>> an enormous amount of data to rewrite after every modification.
>>>>> the
>>>>>>>> root node had a fanning factor of 133; that would still be
alot of
>>>>> data
>>>>>>> to
>>>>>>>> rewrite.
>>>>>>> Hi Riyad,
>>>>>>> Have you observed that in practice?
>>>>>>> Typically the depth of database btrees is not that high even
>>>>>>> millions of documents. For example I have one around with about
>>>>>>> million documents which doesn't have more than 5 or 6 levels
if I
>>>>>>> recall correctly.
>>>>>>> So updating a doc, for that particular case, means rewriting
5 or 6
>>>>>>> new nodes plus the document itself. Each node is normally not
>>>>>>> bigger than 1.2Kb.
>>>>>>> I've written once a tool to analyze database files which reports
>> btree
>>>>>>> depths, however it's not updated to work with recent changes
>>>>>>> master/1.2.x such as snappy compression and btree sizes:
>>>>>>> It should work with CouchDB 1.1 (and older) database files.
>>>>>>>> I am certain I am missing the boat on this; if anyone can
pull me
>> out
>>>>> of
>>>>>>>> the water and point me to dry land I'd appreciate it.
>>>>>>>> Best,
>>>>>>>> R
>>>>>>>> [1]
>>>>>>>> --
>>>>>>>> --
>>>>>>>> --
>>>>>>>> --
>>>>>>>> --* (Over my head)
>>>>>>> --
>>>>>>> Filipe David Manana,
>>>>>>> "Reasonable men adapt themselves to the world.
>>>>>>> Unreasonable men adapt the world to themselves.
>>>>>>> That's why all progress depends on unreasonable men."

View raw message