couchdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Riyad Kalla <rka...@gmail.com>
Subject Re: Understanding the CouchDB file format
Date Wed, 21 Dec 2011 20:28:02 GMT
Bob,

Really appreciate the link; Rick has a handful of articles that helped a
lot.

Along side all the CouchDB reading I've been looking at SSD-optimized data
storage mechanisms and tried to coalesce all of this information into this
post on Couch's file storage format:
https://plus.google.com/u/0/107397941677313236670/posts/CyvwRcvh4vv

It is uncanny how many things Couch seems to have gotten right with regard
to existing storage systems and future flash-based storage systems. I'd
appreciate any corrections, additions or feedback to the post for anyone
interested.

Best,
R

On Wed, Dec 21, 2011 at 12:53 PM, Robert Dionne <
dionne@dionne-associates.com> wrote:

> I think this is largely correct Riyad, I dug out an old article[1] by Rick
> Ho that you may also find helpful though it might be slightly dated.
> Generally the best performance will be had if the ids are sequential and
> updates are done in bulk. Write heavy applications will eat up a lot of
> space and require compaction. At the leaf nodes what are stored are either
> full_doc_info records or doc_info records which store pointers to the data
> so the main thing that impacts the branching at each level are the key size
> and in the case of views the sizes of the reductions as these are stored
> with the intermediate nodes.
>
> All in all it works pretty well but as always you need to test and
> evaluate it for you specific case to see what the limits are.
>
> Regards,
>
> Bob
>
>
> [1] http://horicky.blogspot.com/2008/10/couchdb-implementation.html
>
>
>
>
> On Dec 21, 2011, at 2:17 PM, Riyad Kalla wrote:
>
> > Adding to this conversation, I found this set of slides by Chris
> explaining
> > the append-only index update format:
> > http://www.slideshare.net/jchrisa/btree-nosql-oak?from=embed
> >
> > Specifically slides 16, 17 and 18.
> >
> > Using this example tree, rewriting the updated path (in reverse order)
> > appended to the end of the file makes sense... you see how index queries
> > can simply read backwards from the end of the file and not only find the
> > latest revisions of docs, but also every other doc that wasn't touched
> (it
> > will just seek into the existing inner nodes of the b+ tree for
> searching).
> >
> > What I am hoping for clarification on is the following pain-point that I
> > perceive with this approach:
> >
> > 1. In a sufficiently shallow B+ tree (like CouchDB), the paths themselves
> > to elements are short (typically no more than 3 to 5 levels deep) as
> > opposed to a trie or some other construct that would have much longer
> paths
> > to elements.
> >
> > 2. Because the depth of the tree is so shallow, the breadth of it becomes
> > large to compensate... more specifically, each internal node can have
> 100s,
> > 1000s or more children. Using the example slides, consider the nodes
> > [A...M] and [R...Z] -- in a good sized CouchDB database, those internal
> > index nodes would have 100s (or more) elements in them pointing at deeper
> > internal nodes that themselves had thousands of elements; instead of the
> 13
> > or so as implied by [A...M].
> >
> > 3. Looking at slide 17 and 18, where you see the direct B+ tree path to
> the
> > update node getting appended to the end of the file after the revision is
> > written (leaf to root ordering: [J' M] -> [A M] -> [A Z]) it implies that
> > those internal nodes with *all* their child elements are getting
> rewritten
> > as well.
> >
> > In this example tree, it is isn't such a big issue... but in a
> sufficiently
> > large CouchDB database, these nodes denoted by [A...M] and [A...Z] could
> be
> > quite large... I don't know the format of the node elements in the B+
> tree,
> > but it would be whatever the size of a node is times however many
> elements
> > are contained at each level (1 for root, say 100 for level 2, 1000 for
> > level 3 and 10,000 for level 4 -- there is a lot of hand-waving going on
> > here, of course it depends on the size of the data store).
> >
> > Am I missing something or is CouchDB really rewriting that much index
> > information between document revisions on every update?
> >
> > What was previously confusing me is I thought it was *only* rewriting a
> > direct path to the updated revision, like [B]>[E]>[J'] and Couch was
> > some-how patching in that updated path info to the B+ index at runtime.
> >
> > If couch is rewriting entire node paths with all their elements then I am
> > no longer confused about the B+ index updates, but am curious about the
> > on-disk cost of this.
> >
> > In my own rough insertion testing, that would explain why I see my
> > collections absolutely explode in size until they are compacted (not
> using
> > bulk insert, but intentionally doing single inserts for a million(s) of
> > docs to see what kind of cost the index path duplication would be like).
> >
> > Can anyone confirm/deny/correct this assessment? I want to make sure I am
> > on the right track understanding this.
> >
> > Best wishes,
> > Riyad
> >
> > On Tue, Dec 20, 2011 at 6:13 PM, Riyad Kalla <rkalla@gmail.com> wrote:
> >
> >> @Filipe - I was just not clear on how CouchDB operated; you and Robert
> >> cleared that up for me. Thank you.
> >>
> >> @Robert - The writeup is excellent so far (I am not familiar with
> erlang,
> >> so there is a bit of stickiness there), thank you for taking the time to
> >> put this together!
> >>
> >> At this point I am curious how the _id and _seq indices are read as
> their
> >> data is continually appended to the end of the data file in small
> >> diff-trees for every updated doc.
> >>
> >> If CouchDB kept all the indices in-memory and simply patched-in the
> >> updated paths at runtime (maybe something akin to memory-mapped indices
> in
> >> MongoDB) I would be fairly clear on the operation... but as I understand
> >> it, CouchDB keeps such a small memory footprint by doing no in-memory
> >> caching and relying on the intelligence of the OS and filesystem (and/or
> >> drives) to cache frequently accessed data.
> >>
> >> I am trying to understand the logic used by CouchDB to answer a query
> >> using the index once updates to the tree have been appended to the data
> >> file... for example, consider a CouchDB datastore like the one Filipe
> >> has... 10 million documents and let's say it is freshly compacted.
> >>
> >> If I send in a request to that Couch instance, it hits the header of the
> >> data file along with the index and walks the B+ tree to the leaf node,
> >> where it finds the offset into the data file where the actual doc
> lives...
> >> let's say 1,000,000 bytes away.
> >>
> >> These B+ trees are shallow, so it might look something like this:
> >>
> >> Level 1: 1 node, root node.
> >> Level 2: 100 nodes, inner child nodes
> >> Level 3: 10,000 nodes, inner child nodes
> >> Level 4: 1,000,000, leaf nodes... all with pointers to the data offsets
> in
> >> the data file.
> >>
> >> Now let's say I write 10 updates to documents in that file. There are 10
> >> new revisions appended to the end of the data file *each one* separated
> by
> >> a rewritten B+ path to a leaf node with it's new location at the end of
> the
> >> file. Each of those paths written between each doc revision (say
> roughly 2k
> >> like Filipe mentioned) are just 4 item paths... root -> level1 ->
> level2 ->
> >> level3 --> level4... showing the discrete path from the root to that
> >> individual updated doc. The intermediary levels (l1, 2, 3) are not fully
> >> flushed out with all the OTHER children from the original b+ tree index.
> >>
> >> [[ is this correct so far? If not, please point out my mistakes...]
> >>
> >> Now I issue a query for a document that WAS NOT updated...
> >>
> >> **** this is where I get confused on the logic ***
> >>
> >> this would mean I need to access the original B+ tree index at the root
> of
> >> the data file, because the revised B+ paths that are written between
> each
> >> of the updated doc revisions at the end of the file are not full
> indices.
> >>
> >> NOW consider I want to query for one of the changed docs... now I
> suddenly
> >> need to scan backwards from the data file's end to find the updated
> path to
> >> the new revision of that document.
> >>
> >> (obviously) this isn't what Couch is actually doing... it's doing
> >> something more elegant, I just can't figure out what or how and that is
> >> what I was hoping for help with.
> >>
> >> Much thanks guys, I know this is a heavy question to ask.
> >>
> >> Best wishes,
> >> R
> >>
> >>
> >> On Tue, Dec 20, 2011 at 1:35 PM, Robert Dionne <
> >> dionne@dionne-associates.com> wrote:
> >>
> >>>
> >>> Robert Dionne
> >>> Computer Programmer
> >>> dionne@dionne-associates.com
> >>> 203.231.9961
> >>>
> >>>
> >>>
> >>>
> >>> On Dec 20, 2011, at 3:27 PM, Riyad Kalla wrote:
> >>>
> >>>> Filipe,
> >>>>
> >>>> Thank you for the reply.
> >>>>
> >>>> Maybe I am misunderstanding exactly what couch is writing out; the
> docs
> >>>> I've read say that it "rewrites the root node" -- I can't tell if the
> >>> docs
> >>>> mean the parent node of the child doc that was changed (as one of the
> b+
> >>>> leaves) or if it means the direct path, from the root, to that
> >>> individual
> >>>> doc... or if it means the *entire* index...
> >>>>
> >>>> In the case of even rewriting the single parent, with such a shallow
> >>> tree,
> >>>> each internal leaf will have a huge fan of nodes; let's say 1-10k in
a
> >>>> decent sized data set.
> >>>>
> >>>> If you are seeing a few K of extra written out after each changed doc
> >>> then
> >>>> that cannot be write... I almost assumed my understanding was wrong
> >>> because
> >>>> the sheer volume of data would make performance abysmal if it was
> true.
> >>>>
> >>>> Given that... is it just the changed path, from the root to the new
> leaf
> >>>> that is rewritten?
> >>>
> >>> Hi Riyad,
> >>>
> >>> You are correct, it's only the changed path. Interestingly I've just
> >>> started to document all these internals[1] along with links to the
> code and
> >>> other references available.
> >>>
> >>> Cheers,
> >>>
> >>> Bob
> >>>
> >>>
> >>> [1] http://bdionne.github.com/couchdb/
> >>>
> >>>> That makes me all sorts of curious as to how Couch
> >>>> updates/searches the new modified index with the small diff that is
> >>> written
> >>>> out.
> >>>>
> >>>> Any pointers to reading that will help me dig down on this (even early
> >>> bugs
> >>>> in JIRA?) would be appreciated. I've tried skimming back in 2007/08
on
> >>>> Damien's blog to see if it wrote about it in depth and so far haven't
> >>> found
> >>>> anything as detailed as I am hoping for on this architecture.
> >>>>
> >>>> Best,
> >>>> Riyad
> >>>>
> >>>> On Tue, Dec 20, 2011 at 1:07 PM, Filipe David Manana <
> >>> fdmanana@apache.org>wrote:
> >>>>
> >>>>> On Tue, Dec 20, 2011 at 6:24 PM, Riyad Kalla <rkalla@gmail.com>
> wrote:
> >>>>>> I've been reading everything I can find on the CouchDB file
> format[1]
> >>> and
> >>>>>> am getting bits and pieces here and there, but not a great,
> concrete,
> >>>>>> step-by-step explanation of the process.
> >>>>>>
> >>>>>> I'm clear on the use of B+ trees and after reading a few papers
on
> the
> >>>>>> benefits of log-structured file formats, I understand the benefits
> of
> >>>>>> inlining the B+ tree indices directly into the data file as
well
> >>>>> (locality
> >>>>>> + sequential I/O)... what I'm flummoxed about is how much of
the B+
> >>>>> tree's
> >>>>>> index is rewritten after every modified document.
> >>>>>>
> >>>>>> Consider a CouchDB file that looks more or less like this:
> >>>>>>
> >>>>>> [idx/header][doc1, rev1][idx/header][doc1, rev2]....
> >>>>>>
> >>>>>> After each revised doc is written and the "b-tree root" is rewritten
> >>>>> after
> >>>>>> that, is that just a modified root node of the B+ tree or the
entire
> >>> B+
> >>>>>> tree?
> >>>>>>
> >>>>>> The reason I ask is because regardless of the answer to my previous
> >>>>>> question, for a *huge* database will millions of records, that
seems
> >>> like
> >>>>>> an enormous amount of data to rewrite after every modification.
Say
> >>> the
> >>>>>> root node had a fanning factor of 133; that would still be alot
of
> >>> data
> >>>>> to
> >>>>>> rewrite.
> >>>>>
> >>>>> Hi Riyad,
> >>>>>
> >>>>> Have you observed that in practice?
> >>>>>
> >>>>> Typically the depth of database btrees is not that high even for
> >>>>> millions of documents. For example I have one around with about
10
> >>>>> million documents which doesn't have more than 5 or 6 levels if
I
> >>>>> recall correctly.
> >>>>>
> >>>>> So updating a doc, for that particular case, means rewriting 5 or
6
> >>>>> new nodes plus the document itself. Each node is normally not much
> >>>>> bigger than 1.2Kb.
> >>>>>
> >>>>> I've written once a tool to analyze database files which reports
> btree
> >>>>> depths, however it's not updated to work with recent changes on
> >>>>> master/1.2.x such as snappy compression and btree sizes:
> >>>>>
> >>>>> https://github.com/fdmanana/couchfoo
> >>>>>
> >>>>> It should work with CouchDB 1.1 (and older) database files.
> >>>>>
> >>>>>>
> >>>>>> I am certain I am missing the boat on this; if anyone can pull
me
> out
> >>> of
> >>>>>> the water and point me to dry land I'd appreciate it.
> >>>>>>
> >>>>>> Best,
> >>>>>> R
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> [1]
> >>>>>> --
> >>>>>>
> >>>>>
> >>>
> http://jchrisa.net/drl/_design/sofa/_list/post/post-page?startkey=%5B%22CouchDB-Implements-a-Fundamental-Algorithm%22%5D
> >>>>>> -- http://horicky.blogspot.com/2008/10/couchdb-implementation.html
> >>>>>> -- http://blog.kodekabuki.com/post/132952897/couchdb-naked
> >>>>>> -- http://guide.couchdb.org/editions/1/en/btree.html
> >>>>>> -- http://ayende.com/blog/* (Over my head)
> >>>>>
> >>>>>
> >>>>>
> >>>>> --
> >>>>> Filipe David Manana,
> >>>>>
> >>>>> "Reasonable men adapt themselves to the world.
> >>>>> Unreasonable men adapt the world to themselves.
> >>>>> That's why all progress depends on unreasonable men."
> >>>>>
> >>>
> >>>
> >>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message