From dev-return-19832-apmail-couchdb-dev-archive=couchdb.apache.org@couchdb.apache.org Sun Dec 25 08:43:24 2011 Return-Path: X-Original-To: apmail-couchdb-dev-archive@www.apache.org Delivered-To: apmail-couchdb-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 5B9EB70E3 for ; Sun, 25 Dec 2011 08:43:24 +0000 (UTC) Received: (qmail 75597 invoked by uid 500); 25 Dec 2011 08:43:23 -0000 Delivered-To: apmail-couchdb-dev-archive@couchdb.apache.org Received: (qmail 75551 invoked by uid 500); 25 Dec 2011 08:43:23 -0000 Mailing-List: contact dev-help@couchdb.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@couchdb.apache.org Delivered-To: mailing list dev@couchdb.apache.org Received: (qmail 75543 invoked by uid 99); 25 Dec 2011 08:43:22 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 25 Dec 2011 08:43:22 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=5.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of paul.joseph.davis@gmail.com designates 209.85.220.180 as permitted sender) Received: from [209.85.220.180] (HELO mail-vx0-f180.google.com) (209.85.220.180) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 25 Dec 2011 08:43:17 +0000 Received: by vcbfo14 with SMTP id fo14so10850364vcb.11 for ; Sun, 25 Dec 2011 00:42:56 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :content-type; bh=vqRtYtsss1hV2eUOJeSfFJpqSmeHV/FBteiss/It2ug=; b=IAbXMUgCJQqBxo7cK2NSfbx79VyZ8PEqYs13aK6V/Z0F2tf67T3juE/JueiTqZAUPV 2LcJ3pazCKgC9uqf9/Cxn36zUsKOLk/WMUzJs7N36HiBLSrUnlljh57yKqjVMPzBCsHm 9CvSDRBDxMoPXPnNl2vv8lhIRXYaJH9tp/qqg= Received: by 10.220.155.3 with SMTP id q3mr13635572vcw.36.1324802576170; Sun, 25 Dec 2011 00:42:56 -0800 (PST) MIME-Version: 1.0 Received: by 10.220.190.202 with HTTP; Sun, 25 Dec 2011 00:42:15 -0800 (PST) In-Reply-To: References: <4BB9B050-6977-446E-8844-C6FE83DBDDA4@dionne-associates.com> <73943F19-3FBE-43D7-A97B-E012A391BFED@apache.org> From: Paul Davis Date: Sun, 25 Dec 2011 02:42:15 -0600 Message-ID: Subject: Re: Understanding the CouchDB file format To: dev@couchdb.apache.org Content-Type: text/plain; charset=ISO-8859-1 On Thu, Dec 22, 2011 at 8:53 PM, Riyad Kalla wrote: > Randall, > > Spot on; we are on the same page now. I'll go through the post again > tomorrow morning and reword it a bit so the thoughts aren't so fragmented. > > Best, > Riyad > > P.S.> Are there discussions anywhere on alternative file formats for the > indices that the Couch community has considered in the past? (old JIRA or > mailing list discussions) or has this file format been the only approach > from the initial work started in 05/06? > > The reason I ask is because of the index fragmentation you mentioned that > can occur over time... I'd be curious if an in-place-edited format for the > indices as separate file(s) from the append-only document revisions would > yield better performance over time. The splits would still be expensive, > but you'd be able to pre-allocate your blocks for each node on each split > to help a bit. Even leveraging an append-only, pre-allocated approach to > the index might work where split nodes are appended to the end of the index > file with the full node size (e.g. 133 elements) pre-allocated and the > index marked ready for compaction or something. > > So it would be a hybrid approach... when splits aren't needed, you do an > in-place edit to the index on-disk. If a split is needed, you use a > technique similar to the one now where the effected node hierarchy is > appended to the end of the file. > > You still run the risk of costly index rebuilds in the cast of a failed > in-place edit, but you wouldn't run the risk of any data loss... I am just > wondering for large data sets if this would yield some significant benefits > putting Couch somewhere between a Mongo and Couch (performance wise). > > Just pie-in-the-sky thinking... I am sure the senior team here has talked > this stuff to death years ago. My apologies if this is re-treading road > covered in beaten horse bodies ;) > > I just find this category of problems interesting. > The goal of the view engine refactoring was to encourage people to start asking and answering exactly these types of questions. I got a bit distracted by work stuff since but I've still got a plan and some notes on a blog post about how to write new indexers. Hopefully in time we'll have a wider array of secondary index types than just the m/r style we have currently. > > On Thu, Dec 22, 2011 at 6:10 PM, Randall Leeds wrote: > >> On Thu, Dec 22, 2011 at 12:11, Riyad Kalla wrote: >> > On Thu, Dec 22, 2011 at 12:38 PM, Robert Newson >> wrote: >> > >> >> There are a >> >> few parts of the article that are inaccurate (like the assertion we >> >> have good locality for the id and seq trees. If this were true we >> >> wouldn't have seen such a huge improvement in compaction by >> >> temporarily separating them). >> > >> > >> > I'd look forward to more detail on this... it was my understanding the >> > updates were appended out in the [doc rev][_id idx diff][_seq idx diff] >> > format at the end of the data file. Sounds like I may have misunderstood >> > that? >> > >> >> Riyad, as you pointed out earlier, all the inner nodes are rewritten >> up to the root. The two btrees are not written in parallel, though, >> which means that for deep trees all the updated nodes are written >> before the other tree's nodes are written. Also remember that the >> trees themselves end up pretty fragmented since older nodes that >> haven't changed are back toward the beginning of the file. In general, >> I'm not sure there's much that's useful to mention about locality in >> the trees. Also, updating these trees requires reading the old values, >> so there is still seeking that occurs (if the pages aren't cached by >> the OS). >> >> > >> >> The 'COMPETE recreation' paragraph also >> >> strikes me as factually awry. >> >> >> > >> > I'd appreciate a detailed correction on this if it is wrong; all the >> > digging I've done (in this thread and other partial resources) suggests >> > that the path from the changed doc ref back to the root (including a copy >> > of all internal nodes and all of their child references) is written so as >> > being able to read-back into the index from the tail of the data file >> > quickly... specifically slides 17, 18 and 19 from this slidedeck ( >> > http://www.slideshare.net/jchrisa/btree-nosql-oak?from=embed) -- note >> that >> > the interim nodes [A..M] and [A..Z] are rewritten (including any and all >> > child pointers they contained). >> > >> > This is what I was referring to; I should either clean up my wording >> > (confusing) or I got it wrong in which case I'd appreciate any and all >> > corrections. >> >> Right. It mostly seems a bit confusing to me. >> "it DOES NOT just rewrite the nodes pathing from the leaf to the node >> and ONLY the connections for that single document" >> That doesn't sound quite right, but I can tell what you're trying to >> say is accurate. If I'm right, you mean that every changed inner node >> is rewritten in its entirety rather than having a single pointer to >> the new child updated in place. >> >> Cheers. Thanks for taking the time to write this up. >> >> -Randall >>