Mailing-List: contact dev-help@couchdb.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@couchdb.apache.org
Received-SPF: pass (nike.apache.org: domain of dionne@dionne-associates.com
 designates 69.89.21.11 as permitted sender)
Content-Type: text/plain; charset=iso-8859-1
Mime-Version: 1.0 (Apple Message framework v1251.1)
Subject: Re: Understanding the CouchDB file format
From: Robert Dionne <dionne@dionne-associates.com>
In-Reply-To: 
 <CABn9xAFSDo_BK-xvnfPrCEHd1OD9PMXEiDbwUK40yxpdKVFm9w@mail.gmail.com>
Date: Wed, 21 Dec 2011 14:53:05 -0500
Content-Transfer-Encoding: quoted-printable
Message-Id: <EA9D823D-B8BF-4043-BC29-D5B7DB89DEAF@dionne-associates.com>
References: 
 <CABn9xAEqet5BFjHz3m+OsT8+HMc4v3PSuqBT-GiF9OwP=ssjrQ@mail.gmail.com>
 <CAL3q7H5Jto+vvbPkzba4y1h1o12qJVOQfd583mZD_vxMpB1GzA@mail.gmail.com>
 <CABn9xAHeJw8Mu=hzQCTo4dRz2ZcohfdZKCS4h9yQqp35JMvkWQ@mail.gmail.com>
 <4BB9B050-6977-446E-8844-C6FE83DBDDA4@dionne-associates.com>
 <CABn9xAHyV4Rhp4NjvrGJU_89CzNMbyX+yt0VEM7ev4CwipNOSA@mail.gmail.com>
 <CABn9xAFSDo_BK-xvnfPrCEHd1OD9PMXEiDbwUK40yxpdKVFm9w@mail.gmail.com>
To: dev@couchdb.apache.org

I think this is largely correct Riyad, I dug out an old article[1] by =
Rick Ho that you may also find helpful though it might be slightly =
dated. Generally the best performance will be had if the ids are =
sequential and updates are done in bulk. Write heavy applications will =
eat up a lot of space and require compaction. At the leaf nodes what are =
stored are either full_doc_info records or doc_info records which store =
pointers to the data so the main thing that impacts the branching at =
each level are the key size and in the case of views the sizes of the =
reductions as these are stored with the intermediate nodes.

All in all it works pretty well but as always you need to test and =
evaluate it for you specific case to see what the limits are.

Regards,

Bob


[1] http://horicky.blogspot.com/2008/10/couchdb-implementation.html


On Dec 21, 2011, at 2:17 PM, Riyad Kalla wrote:

> Adding to this conversation, I found this set of slides by Chris =
explaining
> the append-only index update format:
> http://www.slideshare.net/jchrisa/btree-nosql-oak?from=3Dembed
>=20
> Specifically slides 16, 17 and 18.
>=20
> Using this example tree, rewriting the updated path (in reverse order)
> appended to the end of the file makes sense... you see how index =
queries
> can simply read backwards from the end of the file and not only find =
the
> latest revisions of docs, but also every other doc that wasn't touched =
(it
> will just seek into the existing inner nodes of the b+ tree for =
searching).
>=20
> What I am hoping for clarification on is the following pain-point that =
I
> perceive with this approach:
>=20
> 1. In a sufficiently shallow B+ tree (like CouchDB), the paths =
themselves
> to elements are short (typically no more than 3 to 5 levels deep) as
> opposed to a trie or some other construct that would have much longer =
paths
> to elements.
>=20
> 2. Because the depth of the tree is so shallow, the breadth of it =
becomes
> large to compensate... more specifically, each internal node can have =
100s,
> 1000s or more children. Using the example slides, consider the nodes
> [A...M] and [R...Z] -- in a good sized CouchDB database, those =
internal
> index nodes would have 100s (or more) elements in them pointing at =
deeper
> internal nodes that themselves had thousands of elements; instead of =
the 13
> or so as implied by [A...M].
>=20
> 3. Looking at slide 17 and 18, where you see the direct B+ tree path =
to the
> update node getting appended to the end of the file after the revision =
is
> written (leaf to root ordering: [J' M] -> [A M] -> [A Z]) it implies =
that
> those internal nodes with *all* their child elements are getting =
rewritten
> as well.
>=20
> In this example tree, it is isn't such a big issue... but in a =
sufficiently
> large CouchDB database, these nodes denoted by [A...M] and [A...Z] =
could be
> quite large... I don't know the format of the node elements in the B+ =
tree,
> but it would be whatever the size of a node is times however many =
elements
> are contained at each level (1 for root, say 100 for level 2, 1000 for
> level 3 and 10,000 for level 4 -- there is a lot of hand-waving going =
on
> here, of course it depends on the size of the data store).
>=20
> Am I missing something or is CouchDB really rewriting that much index
> information between document revisions on every update?
>=20
> What was previously confusing me is I thought it was *only* rewriting =
a
> direct path to the updated revision, like [B]>[E]>[J'] and Couch was
> some-how patching in that updated path info to the B+ index at =
runtime.
>=20
> If couch is rewriting entire node paths with all their elements then I =
am
> no longer confused about the B+ index updates, but am curious about =
the
> on-disk cost of this.
>=20
> In my own rough insertion testing, that would explain why I see my
> collections absolutely explode in size until they are compacted (not =
using
> bulk insert, but intentionally doing single inserts for a million(s) =
of
> docs to see what kind of cost the index path duplication would be =
like).
>=20
> Can anyone confirm/deny/correct this assessment? I want to make sure I =
am
> on the right track understanding this.
>=20
> Best wishes,
> Riyad
>=20
> On Tue, Dec 20, 2011 at 6:13 PM, Riyad Kalla <rkalla@gmail.com> wrote:
>=20
>> @Filipe - I was just not clear on how CouchDB operated; you and =
Robert
>> cleared that up for me. Thank you.
>>=20
>> @Robert - The writeup is excellent so far (I am not familiar with =
erlang,
>> so there is a bit of stickiness there), thank you for taking the time =
to
>> put this together!
>>=20
>> At this point I am curious how the _id and _seq indices are read as =
their
>> data is continually appended to the end of the data file in small
>> diff-trees for every updated doc.
>>=20
>> If CouchDB kept all the indices in-memory and simply patched-in the
>> updated paths at runtime (maybe something akin to memory-mapped =
indices in
>> MongoDB) I would be fairly clear on the operation... but as I =
understand
>> it, CouchDB keeps such a small memory footprint by doing no in-memory
>> caching and relying on the intelligence of the OS and filesystem =
(and/or
>> drives) to cache frequently accessed data.
>>=20
>> I am trying to understand the logic used by CouchDB to answer a query
>> using the index once updates to the tree have been appended to the =
data
>> file... for example, consider a CouchDB datastore like the one Filipe
>> has... 10 million documents and let's say it is freshly compacted.
>>=20
>> If I send in a request to that Couch instance, it hits the header of =
the
>> data file along with the index and walks the B+ tree to the leaf =
node,
>> where it finds the offset into the data file where the actual doc =
lives...
>> let's say 1,000,000 bytes away.
>>=20
>> These B+ trees are shallow, so it might look something like this:
>>=20
>> Level 1: 1 node, root node.
>> Level 2: 100 nodes, inner child nodes
>> Level 3: 10,000 nodes, inner child nodes
>> Level 4: 1,000,000, leaf nodes... all with pointers to the data =
offsets in
>> the data file.
>>=20
>> Now let's say I write 10 updates to documents in that file. There are =
10
>> new revisions appended to the end of the data file *each one* =
separated by
>> a rewritten B+ path to a leaf node with it's new location at the end =
of the
>> file. Each of those paths written between each doc revision (say =
roughly 2k
>> like Filipe mentioned) are just 4 item paths... root -> level1 -> =
level2 ->
>> level3 --> level4... showing the discrete path from the root to that
>> individual updated doc. The intermediary levels (l1, 2, 3) are not =
fully
>> flushed out with all the OTHER children from the original b+ tree =
index.
>>=20
>> [[ is this correct so far? If not, please point out my mistakes...]
>>=20
>> Now I issue a query for a document that WAS NOT updated...
>>=20
>> **** this is where I get confused on the logic ***
>>=20
>> this would mean I need to access the original B+ tree index at the =
root of
>> the data file, because the revised B+ paths that are written between =
each
>> of the updated doc revisions at the end of the file are not full =
indices.
>>=20
>> NOW consider I want to query for one of the changed docs... now I =
suddenly
>> need to scan backwards from the data file's end to find the updated =
path to
>> the new revision of that document.
>>=20
>> (obviously) this isn't what Couch is actually doing... it's doing
>> something more elegant, I just can't figure out what or how and that =
is
>> what I was hoping for help with.
>>=20
>> Much thanks guys, I know this is a heavy question to ask.
>>=20
>> Best wishes,
>> R
>>=20
>>=20
>> On Tue, Dec 20, 2011 at 1:35 PM, Robert Dionne <
>> dionne@dionne-associates.com> wrote:
>>=20
>>>=20
>>> Robert Dionne
>>> Computer Programmer
>>> dionne@dionne-associates.com
>>> 203.231.9961
>>>=20
>>>=20
>>>=20
>>>=20
>>> On Dec 20, 2011, at 3:27 PM, Riyad Kalla wrote:
>>>=20
>>>> Filipe,
>>>>=20
>>>> Thank you for the reply.
>>>>=20
>>>> Maybe I am misunderstanding exactly what couch is writing out; the =
docs
>>>> I've read say that it "rewrites the root node" -- I can't tell if =
the
>>> docs
>>>> mean the parent node of the child doc that was changed (as one of =
the b+
>>>> leaves) or if it means the direct path, from the root, to that
>>> individual
>>>> doc... or if it means the *entire* index...
>>>>=20
>>>> In the case of even rewriting the single parent, with such a =
shallow
>>> tree,
>>>> each internal leaf will have a huge fan of nodes; let's say 1-10k =
in a
>>>> decent sized data set.
>>>>=20
>>>> If you are seeing a few K of extra written out after each changed =
doc
>>> then
>>>> that cannot be write... I almost assumed my understanding was wrong
>>> because
>>>> the sheer volume of data would make performance abysmal if it was =
true.
>>>>=20
>>>> Given that... is it just the changed path, from the root to the new =
leaf
>>>> that is rewritten?
>>>=20
>>> Hi Riyad,
>>>=20
>>> You are correct, it's only the changed path. Interestingly I've just
>>> started to document all these internals[1] along with links to the =
code and
>>> other references available.
>>>=20
>>> Cheers,
>>>=20
>>> Bob
>>>=20
>>>=20
>>> [1] http://bdionne.github.com/couchdb/
>>>=20
>>>> That makes me all sorts of curious as to how Couch
>>>> updates/searches the new modified index with the small diff that is
>>> written
>>>> out.
>>>>=20
>>>> Any pointers to reading that will help me dig down on this (even =
early
>>> bugs
>>>> in JIRA?) would be appreciated. I've tried skimming back in 2007/08 =
on
>>>> Damien's blog to see if it wrote about it in depth and so far =
haven't
>>> found
>>>> anything as detailed as I am hoping for on this architecture.
>>>>=20
>>>> Best,
>>>> Riyad
>>>>=20
>>>> On Tue, Dec 20, 2011 at 1:07 PM, Filipe David Manana <
>>> fdmanana@apache.org>wrote:
>>>>=20
>>>>> On Tue, Dec 20, 2011 at 6:24 PM, Riyad Kalla <rkalla@gmail.com> =
wrote:
>>>>>> I've been reading everything I can find on the CouchDB file =
format[1]
>>> and
>>>>>> am getting bits and pieces here and there, but not a great, =
concrete,
>>>>>> step-by-step explanation of the process.
>>>>>>=20
>>>>>> I'm clear on the use of B+ trees and after reading a few papers =
on the
>>>>>> benefits of log-structured file formats, I understand the =
benefits of
>>>>>> inlining the B+ tree indices directly into the data file as well
>>>>> (locality
>>>>>> + sequential I/O)... what I'm flummoxed about is how much of the =
B+
>>>>> tree's
>>>>>> index is rewritten after every modified document.
>>>>>>=20
>>>>>> Consider a CouchDB file that looks more or less like this:
>>>>>>=20
>>>>>> [idx/header][doc1, rev1][idx/header][doc1, rev2]....
>>>>>>=20
>>>>>> After each revised doc is written and the "b-tree root" is =
rewritten
>>>>> after
>>>>>> that, is that just a modified root node of the B+ tree or the =
entire
>>> B+
>>>>>> tree?
>>>>>>=20
>>>>>> The reason I ask is because regardless of the answer to my =
previous
>>>>>> question, for a *huge* database will millions of records, that =
seems
>>> like
>>>>>> an enormous amount of data to rewrite after every modification. =
Say
>>> the
>>>>>> root node had a fanning factor of 133; that would still be alot =
of
>>> data
>>>>> to
>>>>>> rewrite.
>>>>>=20
>>>>> Hi Riyad,
>>>>>=20
>>>>> Have you observed that in practice?
>>>>>=20
>>>>> Typically the depth of database btrees is not that high even for
>>>>> millions of documents. For example I have one around with about 10
>>>>> million documents which doesn't have more than 5 or 6 levels if I
>>>>> recall correctly.
>>>>>=20
>>>>> So updating a doc, for that particular case, means rewriting 5 or =
6
>>>>> new nodes plus the document itself. Each node is normally not much
>>>>> bigger than 1.2Kb.
>>>>>=20
>>>>> I've written once a tool to analyze database files which reports =
btree
>>>>> depths, however it's not updated to work with recent changes on
>>>>> master/1.2.x such as snappy compression and btree sizes:
>>>>>=20
>>>>> https://github.com/fdmanana/couchfoo
>>>>>=20
>>>>> It should work with CouchDB 1.1 (and older) database files.
>>>>>=20
>>>>>>=20
>>>>>> I am certain I am missing the boat on this; if anyone can pull me =
out
>>> of
>>>>>> the water and point me to dry land I'd appreciate it.
>>>>>>=20
>>>>>> Best,
>>>>>> R
>>>>>>=20
>>>>>>=20
>>>>>>=20
>>>>>> [1]
>>>>>> --
>>>>>>=20
>>>>>=20
>>> =
http://jchrisa.net/drl/_design/sofa/_list/post/post-page?startkey=3D%5B%22=
CouchDB-Implements-a-Fundamental-Algorithm%22%5D
>>>>>> -- =
http://horicky.blogspot.com/2008/10/couchdb-implementation.html
>>>>>> -- http://blog.kodekabuki.com/post/132952897/couchdb-naked
>>>>>> -- http://guide.couchdb.org/editions/1/en/btree.html
>>>>>> -- http://ayende.com/blog/* (Over my head)
>>>>>=20
>>>>>=20
>>>>>=20
>>>>> --
>>>>> Filipe David Manana,
>>>>>=20
>>>>> "Reasonable men adapt themselves to the world.
>>>>> Unreasonable men adapt the world to themselves.
>>>>> That's why all progress depends on unreasonable men."
>>>>>=20
>>>=20
>>>=20
>>=20