couchdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Riyad Kalla <rka...@gmail.com>
Subject Re: Understanding the CouchDB file format
Date Fri, 23 Dec 2011 02:53:16 GMT
Randall,

Spot on; we are on the same page now. I'll go through the post again
tomorrow morning and reword it a bit so the thoughts aren't so fragmented.

Best,
Riyad

P.S.> Are there discussions anywhere on alternative file formats for the
indices that the Couch community has considered in the past? (old JIRA or
mailing list discussions) or has this file format been the only approach
from the initial work started in 05/06?

The reason I ask is because of the index fragmentation you mentioned that
can occur over time... I'd be curious if an in-place-edited format for the
indices as separate file(s) from the append-only document revisions would
yield better performance over time. The splits would still be expensive,
but you'd be able to pre-allocate your blocks for each node on each split
to help a bit. Even leveraging an append-only, pre-allocated approach to
the index might work where split nodes are appended to the end of the index
file with the full node size (e.g. 133 elements) pre-allocated and the
index marked ready for compaction or something.

So it would be a hybrid approach... when splits aren't needed, you do an
in-place edit to the index on-disk. If a split is needed, you use a
technique similar to the one now where the effected node hierarchy is
appended to the end of the file.

You still run the risk of costly index rebuilds in the cast of a failed
in-place edit, but you wouldn't run the risk of any data loss... I am just
wondering for large data sets if this would yield some significant benefits
putting Couch somewhere between a Mongo and Couch (performance wise).

Just pie-in-the-sky thinking... I am sure the senior team here has talked
this stuff to death years ago. My apologies if this is re-treading road
covered in beaten horse bodies ;)

I just find this category of problems interesting.


On Thu, Dec 22, 2011 at 6:10 PM, Randall Leeds <randall.leeds@gmail.com>wrote:

> On Thu, Dec 22, 2011 at 12:11, Riyad Kalla <rkalla@gmail.com> wrote:
> > On Thu, Dec 22, 2011 at 12:38 PM, Robert Newson <rnewson@apache.org>
> wrote:
> >
> >> There are a
> >> few parts of the article that are inaccurate (like the assertion we
> >> have good locality for the id and seq trees. If this were true we
> >> wouldn't have seen such a huge improvement in compaction by
> >> temporarily separating them).
> >
> >
> > I'd look forward to more detail on this... it was my understanding the
> > updates were appended out in the [doc rev][_id idx diff][_seq idx diff]
> > format at the end of the data file. Sounds like I may have misunderstood
> > that?
> >
>
> Riyad, as you pointed out earlier, all the inner nodes are rewritten
> up to the root. The two btrees are not written in parallel, though,
> which means that for deep trees all the updated nodes are written
> before the other tree's nodes are written. Also remember that the
> trees themselves end up pretty fragmented since older nodes that
> haven't changed are back toward the beginning of the file. In general,
> I'm not sure there's much that's useful to mention about locality in
> the trees. Also, updating these trees requires reading the old values,
> so there is still seeking that occurs (if the pages aren't cached by
> the OS).
>
> >
> >> The 'COMPETE recreation' paragraph also
> >> strikes me as factually awry.
> >>
> >
> > I'd appreciate a detailed correction on this if it is wrong; all the
> > digging I've done (in this thread and other partial resources) suggests
> > that the path from the changed doc ref back to the root (including a copy
> > of all internal nodes and all of their child references) is written so as
> > being able to read-back into the index from the tail of the data file
> > quickly... specifically slides 17, 18 and 19 from this slidedeck (
> > http://www.slideshare.net/jchrisa/btree-nosql-oak?from=embed) -- note
> that
> > the interim nodes [A..M] and [A..Z] are rewritten (including any and all
> > child pointers they contained).
> >
> > This is what I was referring to; I should either clean up my wording
> > (confusing) or I got it wrong in which case I'd appreciate any and all
> > corrections.
>
> Right. It mostly seems a bit confusing to me.
> "it DOES NOT just rewrite the nodes pathing from the leaf to the node
> and ONLY the connections for that single document"
> That doesn't sound quite right, but I can tell what you're trying to
> say is accurate. If I'm right, you mean that every changed inner node
> is rewritten in its entirety rather than having a single pointer to
> the new child updated in place.
>
> Cheers. Thanks for taking the time to write this up.
>
> -Randall
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message