Return-Path: X-Original-To: apmail-couchdb-dev-archive@www.apache.org Delivered-To: apmail-couchdb-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id BF47D11B83 for ; Thu, 9 May 2013 15:24:59 +0000 (UTC) Received: (qmail 427 invoked by uid 500); 9 May 2013 15:24:59 -0000 Delivered-To: apmail-couchdb-dev-archive@couchdb.apache.org Received: (qmail 349 invoked by uid 500); 9 May 2013 15:24:58 -0000 Mailing-List: contact dev-help@couchdb.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@couchdb.apache.org Delivered-To: mailing list dev@couchdb.apache.org Received: (qmail 332 invoked by uid 99); 9 May 2013 15:24:58 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 09 May 2013 15:24:58 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of bchesneau@gmail.com designates 209.85.216.54 as permitted sender) Received: from [209.85.216.54] (HELO mail-qa0-f54.google.com) (209.85.216.54) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 09 May 2013 15:24:52 +0000 Received: by mail-qa0-f54.google.com with SMTP id o13so1723194qaj.20 for ; Thu, 09 May 2013 08:24:31 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:x-received:in-reply-to:references:date:message-id :subject:from:to:content-type:content-transfer-encoding; bh=EKTj8Bl4zI2EJ/YGVY931Cb+qlRaTNzpT29EcyH/8vc=; b=j7zAXau4SZH8a/t4SmNJlHFYK96ViVKCt36G3B3S0JyhgZCDkeTvfrxOkLPt/k/Hvu 65WhGZwb71aBbH5FA2PoSojLWD0NvvsllpObQ2ufXfb+A/A30sFIRWHhC9plsfHbNFxR pzZ5X6G3Ltz9Dcjn91vVraSeVKnYLrSy0z49zkUS6Dw0a786apcR1LuRxCf+McIg7Q0j 7khBOKob2cDYvPvCy7yA387l6lvXQw3M/wq1119oJ9Uhnlj2FREqSqQ5rVxgFnHgJDFj dg70j1vk4SA35bPtCU1/j8mKtQSKgnEsI24GkVeNb9jaZwhBhwzE7LkCgrb/WjPx1o1N ULNg== MIME-Version: 1.0 X-Received: by 10.49.60.98 with SMTP id g2mr10104821qer.51.1368113071008; Thu, 09 May 2013 08:24:31 -0700 (PDT) Received: by 10.49.94.200 with HTTP; Thu, 9 May 2013 08:24:30 -0700 (PDT) In-Reply-To: References: Date: Thu, 9 May 2013 17:24:30 +0200 Message-ID: Subject: Re: [VOTE] Merge BigCouch From: Benoit Chesneau To: "dev@couchdb.apache.org" Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: quoted-printable X-Virus-Checked: Checked by ClamAV on apache.org +1 for opening a branch with it. I think I have some questions but will put them as separate thread. Thanks for all this code :) - benoit On Tue, May 7, 2013 at 10:34 PM, Robert Newson wrote: > Hi All, > > I propose to merge in the following work, > https://github.com/rnewson/couchdb/tree/nebraska-merge-candidate to > the official Apache CouchDB repository to a new branch (i.e, *not* > master). Once there, the full CouchDB developer community can begin > the work to incorporate the code here into an official release. > > You do not need to respond if you are in agreement. If there is no > response in 72 hours, I will assume lazy consensus. If we reach > consensus, I will start the IP clearance process and then the merge. > > As most of you know, Paul Davis and I recently sequestered ourselves > away from society (in a place called Nebraska) to make this merge > happen. I want to clarify that this work is not the BigCouch code you > can see on github.com/cloudant/bigcouch but the Cloudant platform from > which BigCouch was made. This means it is bang up to date with all the > bug fixes and feature enhancements we've made in the last eighteen > months or more. With that clarification made, here are our notes about > what we achieved, what it means to the project and what isn't yet > done; > > Nebraska Merge Roundup > > > Stats: > > > 1402 - total new commits > > 312 - commits written during the merge (will be reduced substantially > by squashing) > > 408 - number of files changed > > 21,897 - number of lines added > > 4,277 - number of lines removed > > A retrospective: > > Bob Newson and I have come to the end of our merge sprint on getting > BigCouch merged into Apache CouchDB. Its been a productive ten days > here in the midwest. I managed to get Bob out to a bowling alley and > he managed to get me to a sushi restaurant. In between the cultural > exchanges we=92ve also managed to get a significant amount of work done > on the merging as well. > > > The current status of the merge is that we=92ve managed to resolve the > differences in the single node execution of CouchDB. Both the > JavaScript and Erlang test suites run with only one failure in the > Erlang test suite due to a (deliberately) missing constraint on the > number of operating system processes. This should be a relatively > straightforward fix but was not prioritized during our limited time to > work on the larger issues. > > > We merged a large number of performance and stability enhancements > back into single node CouchDB as well as a number of pure bug fixes. > The biggest highlight is a brand new compactor that is both faster and > creates smaller and better organized post-compaction databases. > > > The current status of the merge is that single node operations should > be completely unaffected as demonstrated by the test suite passing. On > the other hand we haven=92t yet finished getting the clustered code > merged to use some of the new changes in single node CouchDB. The > single most significant portion of this work involves updates to the > internal cluster API for views to use the recently rewritten indexer > APIs. This should be a relatively straightforward bit of work that > we=92ll be finishing over the next few weeks. > > > All in all the merge work done so far has been quite successful. We=92ve > met our primary goal of getting the code merged in a fashion that does > not affect single node operation while providing a starting point for > the larger community to start reviewing the more significant changes > made. Given the size of the diff between the two code bases we never > expected to have a fully working clustered solution after ten days of > work but we have succeeded in providing a base of work that will allow > us and new contributors to get up to speed quickly. > > > This work, coupled with work by Dave Cottlehuber and Beno=EEt Chesneau > on updating the build system and various other internal updates, will > provide a solid foundation for work going forward. Its an exciting > time for CouchDB and anyone interested should keep an eye on the next > few releases as we ramp up work on various core aspects of the > database. > > > We=92ve had an exciting few days working to prepare the road for an > exciting next twelve to eighteen months. We hope that everyone will > feel as excited as we do about the next twelve to eighteen months for > Apache CouchDB. It should be an exciting ride. > > > > Things we got done > > > * Large update to the source tree layout for Erlang applications. Each > application now has a src/appname/(c_src|ebin|priv|src) structure. The > build system has been updated. > > * Renamed src/couchdb to src/couch to match the Erlang convention of > the top directory name matching the Erlang application name. > > * Imported Cloudant Erlang applications for clustered CouchDB. These > are imported with their history by using git subtree and merging the > top level commit. These are not external deps, development will happen > within the CouchDB tree. The imported apps are: > > > * config - A couch_config replacement (Behavior is mostly identical > to couch_config except how we listen for configuration changes > internally to allow for smooth hot code upgrade). > > * twig - An rsyslog source replacement for couch_log. > > * rexi - An RPC library. Replaces Erlang=92s built-in rex application > to avoid costly safety measures in the interest of performance and > throughput. > > * mem3 - The =93Dynamo=94 part of BigCouch responsible for managing cl= uster state > > * fabric - The internal cluster-aware CouachDB API > > * ets_lru - A small library application that provides an LRU > implementation using a couple ets tables. > > * ddoc_cache - Caches design documents on each node for use in > design handler functions. This uses an ets_lru cache with a very short > TTL. > > * chttpd - The cluster aware HTTP layer > > > Each imported app also had its build system updated to use Autotools > along with the necessary updates noted above for the new application > layouts for existing CouchDB erlang apps. > > > * Merged a large amount of updates and fixes to couch_replicator based > on work done internally at Cloudant. Unfortunately due to an error > when we created our internal clone we lost a bit of history in some of > the initial merge and have a big commit that affects > couch_replicator_manager mostly. There are a number of other commits > related to couch_replicator that resolve the single node vs. clustered > differences. Some noticeable couch_replicator features: > > > * Optionally disable checkpoints so that replication can work when > a source is read only. This should only be used for smaller databases > as each replication call has to scan the entire source database on > each invocation. > > * A new changes_pending field in the _active_tasks output > > * A fix to the continuous replication to automatically reconnect to > a continuous changes feed when it sees a last_seq value. This allows > for the source to selectively recycle the HTTP connections used which > can be quite useful for =93permanent=94 replications. > > * A multitude of smaller bug fix and stability enhancements. > > > Updates to single node couch: > > > * We changed the by_seq tree to store a copy of the #full_doc_info{} > record instead of the #doc_info{} record. This gives significant speed > improvements for compaction and replication and generally anything > that needs to walk the by_seq tree and access document bodies > internally. > > * We rewrote the compactor to be significantly faster as well as > provides significantly better compacted databases. The two main halves > are to use a temp file and replace the use of btrees in the temp file. > The temp file only contains a temporary copy of the document ids. At > the end of a compaction run we then rebuild the by_id btree in the > compaction file from this temp file. The reason this helps so much is > that the compaction is based on the update_seq btree, which for most > cases means that the id tree is updated in roughly random order which > is very bad for our append only btrees. By using the tmp file we can > stream it in order back into the compacted db file at the end of > compacting, generating a minimum amount of garbage in the process. The > other upgrade was to implement an external merge sort module > (couch_emsort) that is used with this temporary file. > > * Reject updates to design docs that introduce updates that break > compilation for source code. Currently we only check map and reduce > calls as the other should provide user visible errors instead of > inexplicably empty views. > > because my OCD kicked in and I was unable to resist. > > * Reverted a change made a long time ago that uses two file > descriptors for each database. See the todo list. > > * The reason to remove the second fd is so that we can rewrite ref > counting. Better ref counting makes everyone happy, but the real > reason is for this next bullet point: > > * Optimize couch_server to not require a round trip message pass for > opening a database that=92s in the LRU. This is a significant > performance boost for high concurrency access. We also optimized > couch_server internals to not blow up when it=92s under load. > > * Introduce a #leaf{} record into the revision trees. This is never > written to disk but makes internal code a lot cleaner when dealing > with multiple versions of rev tree values. > > * Some changes to couch_changes to enable clustered access. Also some > general cleanup > > * Internal changes to how CouchDB is booted in Erlang land. Not very > sexy but this removes a lot of complicated un-Erlangy bits. We still > have a bit of work left here. > > * btree chunk sizes are now configurable which can allow people to > adjust the RAM/speed tradeoffs a bit more. > > * We now load update validation functions on the first write. This is > a cluster-motivated change because the clustered version of this call > is expensive and can lead to race conditions when opening a bunch of > db shards simultaneously. This should be invisible to external > clients. > > * Disabled conflict detection for local docs. They don=92t replicate so > there=92s no point. This just led to clusters getting stuck and confused > when there were lots of replications happening. > > * Changes to the multipart/mime parsing code. Necessary for clustered > attachment uploads to split the incoming data stream into N copies. > > * Don=92t use init:restart/0 when reloading the ICU driver. I think > this has a bug. But we should rewrite this driver to be a NIF anyway. > > * New couch OS process manager. Significantly faster access to OS > processes under heavy load. This replaces the hard limit with a soft > limit. Process spawned over the soft limit will be used until they=92ve > sat idle for a few minutes and then be closed. We have a todo item to > add the hard ceiling back in (while keeping the soft ceiling). > > * Automatically replace some easily identifiable JS reductions with > their builtin counterparts. Uses a regex to do the detection so its > not too smart. > > * Improved view updater write batch. > > * Updates to couchjs=92 views.js to improve index update speeds > > * Updates to the _stats bultin reduce to allow reduces to work over > emitted stats objects. Sometimes clients have summary data in a doc, > and this allows them to combine stats if they follow the same pattern > as the builtin expects. > > * Added a config:reload() that is accessible by POST=92ing to > _config/_reload. Used by the JS tests to reset the config to what's on > disk. This should prevent those test run failures where a test fails > leaving the config in a bad state causing all subsequent tests to > fail. I think. Maybe. > > * Databases are deleted synchronously in the test suite. We may need > to address this on Windows. But it does seem to reduce the number of > =93{error, file_exists}=94 failures. > > * I reimplemented the JS restartServer() function. There=92s a new > _restart/token URL that will given a unique value for each instance of > the Erlang VM. To run a restart we grab the current token value, hit > _restart, then wait till we get a successful response with a different > token. This appears to have made the restart strategy more robust. > > > > Things that need doing > > > IP Clearance - > > > We=92ll need to track down if we have the CCLA as well as look at each > source file added to make sure each one is strictly from Cloudant or > has an amenable license. I=92m pretty sure that the only one of interest > is trunc_io.erl but we need to be thorough. > > documentation - > > > There shouldn=92t be much here since the entire point of this merge was > to not change the visible behavior of single node couch. A few things > to add about the testing endpoints. Maybe an update to the compaction > section mention the two new file names used. > > > Copyright notices - > > > We need to strip out copyright notices from individual files and make > sure all files have a standard Apache License v2 header. > > > clustered vhosts - > > > We=92ve never implemented this at Cloudant. We either need to write a > cluster or go back and tell people to use HAProxy (or similar) for > such things. > > > twig - > > > We need to add another output type to twig that is configurable in > some manner. Right now we spit out entire rsyslog records which isn=92t > useful for most people. We=92ll need to implement the file writer from > couch_log as well as update the _log HTTP handler to know when it can > and can=92t expect to find data on disk. > > > fabric - > > > This is going to need a lot of work. Specifically view access is going > to need to be updated to work with couch_mrview and friends. > > > Boot a dev cluster - > > > Once we fix up the clustering code we=92ll need to write instructions > and scripts for pulling up a dev cluster. > > > OTP stuff - > > > We=92ve updated each app but we still need to pull some parts out of > couchdb into their own application. Specifically the HTTP layer needs > its own app. We could probably pull out the os process/query_servers > as well as the os daemons and friends. Once done we need to update the > supervision trees so we don=92t have things like couch starting and > managing the replication manager process. > > > ddoc_cache - > > > Wire this up in couch_httpd_db to actually be used. Right now its only > used in chttpd. > > > couch_file upgrade - > > > The revert to remove the second updater_fd from each #db{} record > means that we=92re back in the original position of files appearing to > slow down significantly under load. Since the initial hammer approach > of just adding a second fd we=92ve since discovered that the underlying > bug is due to the way that message passing works combined with > Erlang=92s file io. Significantly though is the fact that the fix is > rather simple to implement. A first draft of this work is on an old > branch of mine here: > > > https://github.com/davisp/couchdb/commit/d856878 > > > finish the size calculating changes - > > > The #leaf{} record change is to enable us to add more data size > calculations. CouchDB master calculates a data size that account for > all bytes that are active in a .couch file. Cloudant is interested in > the total size of uncompressed docs and attachments minus the internal > overhead of btrees. And there=92s a fourth number to calculate based on > the compression level used. Having each of these numbers will be > useful as well as the calculations they=92ll enable (ie, dead bytes in > file, bytes used for overhead, compression ratio achieved, etc). > > > couch_proc_manager - > > > We need to implement the hard ceiling for capping the number of OS > processes. We=92ve started seeing a need for this at Cloudant with some > work loads so motivation to fix this is high. The only failing etap is > the assertion of this ceiling. > > > Synchronous db delete on Windows - > > > I did this because running the test suite was driving me bonkers. I > need to ask Dave about how this behaves on Windows (my guess is not > well) but I think we can close things up so that it works better than > the status quo.