couchdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paul Davis <>
Subject Re: Understanding the CouchDB file format
Date Sun, 25 Dec 2011 08:42:15 GMT
On Thu, Dec 22, 2011 at 8:53 PM, Riyad Kalla <> wrote:
> Randall,
> Spot on; we are on the same page now. I'll go through the post again
> tomorrow morning and reword it a bit so the thoughts aren't so fragmented.
> Best,
> Riyad
> P.S.> Are there discussions anywhere on alternative file formats for the
> indices that the Couch community has considered in the past? (old JIRA or
> mailing list discussions) or has this file format been the only approach
> from the initial work started in 05/06?
> The reason I ask is because of the index fragmentation you mentioned that
> can occur over time... I'd be curious if an in-place-edited format for the
> indices as separate file(s) from the append-only document revisions would
> yield better performance over time. The splits would still be expensive,
> but you'd be able to pre-allocate your blocks for each node on each split
> to help a bit. Even leveraging an append-only, pre-allocated approach to
> the index might work where split nodes are appended to the end of the index
> file with the full node size (e.g. 133 elements) pre-allocated and the
> index marked ready for compaction or something.
> So it would be a hybrid approach... when splits aren't needed, you do an
> in-place edit to the index on-disk. If a split is needed, you use a
> technique similar to the one now where the effected node hierarchy is
> appended to the end of the file.
> You still run the risk of costly index rebuilds in the cast of a failed
> in-place edit, but you wouldn't run the risk of any data loss... I am just
> wondering for large data sets if this would yield some significant benefits
> putting Couch somewhere between a Mongo and Couch (performance wise).
> Just pie-in-the-sky thinking... I am sure the senior team here has talked
> this stuff to death years ago. My apologies if this is re-treading road
> covered in beaten horse bodies ;)
> I just find this category of problems interesting.

The goal of the view engine refactoring was to encourage people to
start asking and answering exactly these types of questions. I got a
bit distracted by work stuff since but I've still got a plan and some
notes on a blog post about how to write new indexers. Hopefully in
time we'll have a wider array of secondary index types than just the
m/r style we have currently.

> On Thu, Dec 22, 2011 at 6:10 PM, Randall Leeds <>wrote:
>> On Thu, Dec 22, 2011 at 12:11, Riyad Kalla <> wrote:
>> > On Thu, Dec 22, 2011 at 12:38 PM, Robert Newson <>
>> wrote:
>> >
>> >> There are a
>> >> few parts of the article that are inaccurate (like the assertion we
>> >> have good locality for the id and seq trees. If this were true we
>> >> wouldn't have seen such a huge improvement in compaction by
>> >> temporarily separating them).
>> >
>> >
>> > I'd look forward to more detail on this... it was my understanding the
>> > updates were appended out in the [doc rev][_id idx diff][_seq idx diff]
>> > format at the end of the data file. Sounds like I may have misunderstood
>> > that?
>> >
>> Riyad, as you pointed out earlier, all the inner nodes are rewritten
>> up to the root. The two btrees are not written in parallel, though,
>> which means that for deep trees all the updated nodes are written
>> before the other tree's nodes are written. Also remember that the
>> trees themselves end up pretty fragmented since older nodes that
>> haven't changed are back toward the beginning of the file. In general,
>> I'm not sure there's much that's useful to mention about locality in
>> the trees. Also, updating these trees requires reading the old values,
>> so there is still seeking that occurs (if the pages aren't cached by
>> the OS).
>> >
>> >> The 'COMPETE recreation' paragraph also
>> >> strikes me as factually awry.
>> >>
>> >
>> > I'd appreciate a detailed correction on this if it is wrong; all the
>> > digging I've done (in this thread and other partial resources) suggests
>> > that the path from the changed doc ref back to the root (including a copy
>> > of all internal nodes and all of their child references) is written so as
>> > being able to read-back into the index from the tail of the data file
>> > quickly... specifically slides 17, 18 and 19 from this slidedeck (
>> > -- note
>> that
>> > the interim nodes [A..M] and [A..Z] are rewritten (including any and all
>> > child pointers they contained).
>> >
>> > This is what I was referring to; I should either clean up my wording
>> > (confusing) or I got it wrong in which case I'd appreciate any and all
>> > corrections.
>> Right. It mostly seems a bit confusing to me.
>> "it DOES NOT just rewrite the nodes pathing from the leaf to the node
>> and ONLY the connections for that single document"
>> That doesn't sound quite right, but I can tell what you're trying to
>> say is accurate. If I'm right, you mean that every changed inner node
>> is rewritten in its entirety rather than having a single pointer to
>> the new child updated in place.
>> Cheers. Thanks for taking the time to write this up.
>> -Randall

View raw message