Return-Path: X-Original-To: apmail-couchdb-dev-archive@www.apache.org Delivered-To: apmail-couchdb-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 403237E1F for ; Wed, 21 Dec 2011 19:53:37 +0000 (UTC) Received: (qmail 95197 invoked by uid 500); 21 Dec 2011 19:53:36 -0000 Delivered-To: apmail-couchdb-dev-archive@couchdb.apache.org Received: (qmail 95168 invoked by uid 500); 21 Dec 2011 19:53:36 -0000 Mailing-List: contact dev-help@couchdb.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@couchdb.apache.org Delivered-To: mailing list dev@couchdb.apache.org Received: (qmail 95160 invoked by uid 99); 21 Dec 2011 19:53:36 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 21 Dec 2011 19:53:36 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=5.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of dionne@dionne-associates.com designates 69.89.21.11 as permitted sender) Received: from [69.89.21.11] (HELO oproxy4-pub.bluehost.com) (69.89.21.11) by apache.org (qpsmtpd/0.29) with SMTP; Wed, 21 Dec 2011 19:53:30 +0000 Received: (qmail 11401 invoked by uid 0); 21 Dec 2011 19:53:08 -0000 Received: from unknown (HELO host183.hostmonster.com) (74.220.207.183) by cpoproxy1.bluehost.com with SMTP; 21 Dec 2011 19:53:07 -0000 DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=dionne-associates.com; s=default; h=To:References:Message-Id:Content-Transfer-Encoding:Date:In-Reply-To:From:Subject:Mime-Version:Content-Type; bh=8FPpHsDrxyo9hgmpzznFP+CsDhIVCCX+B1OoqEq3kUg=; b=Th9T/nQx0p1e3f7rtfDJOoXGkphiUS1aFsVYcpWRaYMacw5EOhADsbqNRDukEgCEACFlJtFYRKqN2bbB3FnOd7aFeQWsYZD8mbR5Vcrc3aQyrryvv2J52gAzEmbex8AF; Received: from adsl-99-36-6-255.dsl.wlfrct.sbcglobal.net ([99.36.6.255] helo=[192.168.1.110]) by host183.hostmonster.com with esmtpa (Exim 4.76) (envelope-from ) id 1RdSDr-0002Vu-Ab for dev@couchdb.apache.org; Wed, 21 Dec 2011 12:53:07 -0700 Content-Type: text/plain; charset=iso-8859-1 Mime-Version: 1.0 (Apple Message framework v1251.1) Subject: Re: Understanding the CouchDB file format From: Robert Dionne In-Reply-To: Date: Wed, 21 Dec 2011 14:53:05 -0500 Content-Transfer-Encoding: quoted-printable Message-Id: References: <4BB9B050-6977-446E-8844-C6FE83DBDDA4@dionne-associates.com> To: dev@couchdb.apache.org X-Mailer: Apple Mail (2.1251.1) X-Identified-User: {2551:host183.hostmonster.com:dionneas:dionne-associates.com} {sentby:smtp auth 99.36.6.255 authed with dionne@dionne-associates.com} X-Virus-Checked: Checked by ClamAV on apache.org I think this is largely correct Riyad, I dug out an old article[1] by = Rick Ho that you may also find helpful though it might be slightly = dated. Generally the best performance will be had if the ids are = sequential and updates are done in bulk. Write heavy applications will = eat up a lot of space and require compaction. At the leaf nodes what are = stored are either full_doc_info records or doc_info records which store = pointers to the data so the main thing that impacts the branching at = each level are the key size and in the case of views the sizes of the = reductions as these are stored with the intermediate nodes. All in all it works pretty well but as always you need to test and = evaluate it for you specific case to see what the limits are. Regards, Bob [1] http://horicky.blogspot.com/2008/10/couchdb-implementation.html On Dec 21, 2011, at 2:17 PM, Riyad Kalla wrote: > Adding to this conversation, I found this set of slides by Chris = explaining > the append-only index update format: > http://www.slideshare.net/jchrisa/btree-nosql-oak?from=3Dembed >=20 > Specifically slides 16, 17 and 18. >=20 > Using this example tree, rewriting the updated path (in reverse order) > appended to the end of the file makes sense... you see how index = queries > can simply read backwards from the end of the file and not only find = the > latest revisions of docs, but also every other doc that wasn't touched = (it > will just seek into the existing inner nodes of the b+ tree for = searching). >=20 > What I am hoping for clarification on is the following pain-point that = I > perceive with this approach: >=20 > 1. In a sufficiently shallow B+ tree (like CouchDB), the paths = themselves > to elements are short (typically no more than 3 to 5 levels deep) as > opposed to a trie or some other construct that would have much longer = paths > to elements. >=20 > 2. Because the depth of the tree is so shallow, the breadth of it = becomes > large to compensate... more specifically, each internal node can have = 100s, > 1000s or more children. Using the example slides, consider the nodes > [A...M] and [R...Z] -- in a good sized CouchDB database, those = internal > index nodes would have 100s (or more) elements in them pointing at = deeper > internal nodes that themselves had thousands of elements; instead of = the 13 > or so as implied by [A...M]. >=20 > 3. Looking at slide 17 and 18, where you see the direct B+ tree path = to the > update node getting appended to the end of the file after the revision = is > written (leaf to root ordering: [J' M] -> [A M] -> [A Z]) it implies = that > those internal nodes with *all* their child elements are getting = rewritten > as well. >=20 > In this example tree, it is isn't such a big issue... but in a = sufficiently > large CouchDB database, these nodes denoted by [A...M] and [A...Z] = could be > quite large... I don't know the format of the node elements in the B+ = tree, > but it would be whatever the size of a node is times however many = elements > are contained at each level (1 for root, say 100 for level 2, 1000 for > level 3 and 10,000 for level 4 -- there is a lot of hand-waving going = on > here, of course it depends on the size of the data store). >=20 > Am I missing something or is CouchDB really rewriting that much index > information between document revisions on every update? >=20 > What was previously confusing me is I thought it was *only* rewriting = a > direct path to the updated revision, like [B]>[E]>[J'] and Couch was > some-how patching in that updated path info to the B+ index at = runtime. >=20 > If couch is rewriting entire node paths with all their elements then I = am > no longer confused about the B+ index updates, but am curious about = the > on-disk cost of this. >=20 > In my own rough insertion testing, that would explain why I see my > collections absolutely explode in size until they are compacted (not = using > bulk insert, but intentionally doing single inserts for a million(s) = of > docs to see what kind of cost the index path duplication would be = like). >=20 > Can anyone confirm/deny/correct this assessment? I want to make sure I = am > on the right track understanding this. >=20 > Best wishes, > Riyad >=20 > On Tue, Dec 20, 2011 at 6:13 PM, Riyad Kalla wrote: >=20 >> @Filipe - I was just not clear on how CouchDB operated; you and = Robert >> cleared that up for me. Thank you. >>=20 >> @Robert - The writeup is excellent so far (I am not familiar with = erlang, >> so there is a bit of stickiness there), thank you for taking the time = to >> put this together! >>=20 >> At this point I am curious how the _id and _seq indices are read as = their >> data is continually appended to the end of the data file in small >> diff-trees for every updated doc. >>=20 >> If CouchDB kept all the indices in-memory and simply patched-in the >> updated paths at runtime (maybe something akin to memory-mapped = indices in >> MongoDB) I would be fairly clear on the operation... but as I = understand >> it, CouchDB keeps such a small memory footprint by doing no in-memory >> caching and relying on the intelligence of the OS and filesystem = (and/or >> drives) to cache frequently accessed data. >>=20 >> I am trying to understand the logic used by CouchDB to answer a query >> using the index once updates to the tree have been appended to the = data >> file... for example, consider a CouchDB datastore like the one Filipe >> has... 10 million documents and let's say it is freshly compacted. >>=20 >> If I send in a request to that Couch instance, it hits the header of = the >> data file along with the index and walks the B+ tree to the leaf = node, >> where it finds the offset into the data file where the actual doc = lives... >> let's say 1,000,000 bytes away. >>=20 >> These B+ trees are shallow, so it might look something like this: >>=20 >> Level 1: 1 node, root node. >> Level 2: 100 nodes, inner child nodes >> Level 3: 10,000 nodes, inner child nodes >> Level 4: 1,000,000, leaf nodes... all with pointers to the data = offsets in >> the data file. >>=20 >> Now let's say I write 10 updates to documents in that file. There are = 10 >> new revisions appended to the end of the data file *each one* = separated by >> a rewritten B+ path to a leaf node with it's new location at the end = of the >> file. Each of those paths written between each doc revision (say = roughly 2k >> like Filipe mentioned) are just 4 item paths... root -> level1 -> = level2 -> >> level3 --> level4... showing the discrete path from the root to that >> individual updated doc. The intermediary levels (l1, 2, 3) are not = fully >> flushed out with all the OTHER children from the original b+ tree = index. >>=20 >> [[ is this correct so far? If not, please point out my mistakes...] >>=20 >> Now I issue a query for a document that WAS NOT updated... >>=20 >> **** this is where I get confused on the logic *** >>=20 >> this would mean I need to access the original B+ tree index at the = root of >> the data file, because the revised B+ paths that are written between = each >> of the updated doc revisions at the end of the file are not full = indices. >>=20 >> NOW consider I want to query for one of the changed docs... now I = suddenly >> need to scan backwards from the data file's end to find the updated = path to >> the new revision of that document. >>=20 >> (obviously) this isn't what Couch is actually doing... it's doing >> something more elegant, I just can't figure out what or how and that = is >> what I was hoping for help with. >>=20 >> Much thanks guys, I know this is a heavy question to ask. >>=20 >> Best wishes, >> R >>=20 >>=20 >> On Tue, Dec 20, 2011 at 1:35 PM, Robert Dionne < >> dionne@dionne-associates.com> wrote: >>=20 >>>=20 >>> Robert Dionne >>> Computer Programmer >>> dionne@dionne-associates.com >>> 203.231.9961 >>>=20 >>>=20 >>>=20 >>>=20 >>> On Dec 20, 2011, at 3:27 PM, Riyad Kalla wrote: >>>=20 >>>> Filipe, >>>>=20 >>>> Thank you for the reply. >>>>=20 >>>> Maybe I am misunderstanding exactly what couch is writing out; the = docs >>>> I've read say that it "rewrites the root node" -- I can't tell if = the >>> docs >>>> mean the parent node of the child doc that was changed (as one of = the b+ >>>> leaves) or if it means the direct path, from the root, to that >>> individual >>>> doc... or if it means the *entire* index... >>>>=20 >>>> In the case of even rewriting the single parent, with such a = shallow >>> tree, >>>> each internal leaf will have a huge fan of nodes; let's say 1-10k = in a >>>> decent sized data set. >>>>=20 >>>> If you are seeing a few K of extra written out after each changed = doc >>> then >>>> that cannot be write... I almost assumed my understanding was wrong >>> because >>>> the sheer volume of data would make performance abysmal if it was = true. >>>>=20 >>>> Given that... is it just the changed path, from the root to the new = leaf >>>> that is rewritten? >>>=20 >>> Hi Riyad, >>>=20 >>> You are correct, it's only the changed path. Interestingly I've just >>> started to document all these internals[1] along with links to the = code and >>> other references available. >>>=20 >>> Cheers, >>>=20 >>> Bob >>>=20 >>>=20 >>> [1] http://bdionne.github.com/couchdb/ >>>=20 >>>> That makes me all sorts of curious as to how Couch >>>> updates/searches the new modified index with the small diff that is >>> written >>>> out. >>>>=20 >>>> Any pointers to reading that will help me dig down on this (even = early >>> bugs >>>> in JIRA?) would be appreciated. I've tried skimming back in 2007/08 = on >>>> Damien's blog to see if it wrote about it in depth and so far = haven't >>> found >>>> anything as detailed as I am hoping for on this architecture. >>>>=20 >>>> Best, >>>> Riyad >>>>=20 >>>> On Tue, Dec 20, 2011 at 1:07 PM, Filipe David Manana < >>> fdmanana@apache.org>wrote: >>>>=20 >>>>> On Tue, Dec 20, 2011 at 6:24 PM, Riyad Kalla = wrote: >>>>>> I've been reading everything I can find on the CouchDB file = format[1] >>> and >>>>>> am getting bits and pieces here and there, but not a great, = concrete, >>>>>> step-by-step explanation of the process. >>>>>>=20 >>>>>> I'm clear on the use of B+ trees and after reading a few papers = on the >>>>>> benefits of log-structured file formats, I understand the = benefits of >>>>>> inlining the B+ tree indices directly into the data file as well >>>>> (locality >>>>>> + sequential I/O)... what I'm flummoxed about is how much of the = B+ >>>>> tree's >>>>>> index is rewritten after every modified document. >>>>>>=20 >>>>>> Consider a CouchDB file that looks more or less like this: >>>>>>=20 >>>>>> [idx/header][doc1, rev1][idx/header][doc1, rev2].... >>>>>>=20 >>>>>> After each revised doc is written and the "b-tree root" is = rewritten >>>>> after >>>>>> that, is that just a modified root node of the B+ tree or the = entire >>> B+ >>>>>> tree? >>>>>>=20 >>>>>> The reason I ask is because regardless of the answer to my = previous >>>>>> question, for a *huge* database will millions of records, that = seems >>> like >>>>>> an enormous amount of data to rewrite after every modification. = Say >>> the >>>>>> root node had a fanning factor of 133; that would still be alot = of >>> data >>>>> to >>>>>> rewrite. >>>>>=20 >>>>> Hi Riyad, >>>>>=20 >>>>> Have you observed that in practice? >>>>>=20 >>>>> Typically the depth of database btrees is not that high even for >>>>> millions of documents. For example I have one around with about 10 >>>>> million documents which doesn't have more than 5 or 6 levels if I >>>>> recall correctly. >>>>>=20 >>>>> So updating a doc, for that particular case, means rewriting 5 or = 6 >>>>> new nodes plus the document itself. Each node is normally not much >>>>> bigger than 1.2Kb. >>>>>=20 >>>>> I've written once a tool to analyze database files which reports = btree >>>>> depths, however it's not updated to work with recent changes on >>>>> master/1.2.x such as snappy compression and btree sizes: >>>>>=20 >>>>> https://github.com/fdmanana/couchfoo >>>>>=20 >>>>> It should work with CouchDB 1.1 (and older) database files. >>>>>=20 >>>>>>=20 >>>>>> I am certain I am missing the boat on this; if anyone can pull me = out >>> of >>>>>> the water and point me to dry land I'd appreciate it. >>>>>>=20 >>>>>> Best, >>>>>> R >>>>>>=20 >>>>>>=20 >>>>>>=20 >>>>>> [1] >>>>>> -- >>>>>>=20 >>>>>=20 >>> = http://jchrisa.net/drl/_design/sofa/_list/post/post-page?startkey=3D%5B%22= CouchDB-Implements-a-Fundamental-Algorithm%22%5D >>>>>> -- = http://horicky.blogspot.com/2008/10/couchdb-implementation.html >>>>>> -- http://blog.kodekabuki.com/post/132952897/couchdb-naked >>>>>> -- http://guide.couchdb.org/editions/1/en/btree.html >>>>>> -- http://ayende.com/blog/* (Over my head) >>>>>=20 >>>>>=20 >>>>>=20 >>>>> -- >>>>> Filipe David Manana, >>>>>=20 >>>>> "Reasonable men adapt themselves to the world. >>>>> Unreasonable men adapt the world to themselves. >>>>> That's why all progress depends on unreasonable men." >>>>>=20 >>>=20 >>>=20 >>=20