Mailing-List: contact dev-help@couchdb.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@couchdb.apache.org
Received-SPF: pass (nike.apache.org: domain of bchesneau@gmail.com designates
 209.85.216.54 as permitted sender)
MIME-Version: 1.0
In-Reply-To: 
 <CABvT1DHi8Ouj6S-kyf+EeWRNHUfm+RVnyDbkz_XAvV9OV-OjEw@mail.gmail.com>
References: 
 <CABvT1DHi8Ouj6S-kyf+EeWRNHUfm+RVnyDbkz_XAvV9OV-OjEw@mail.gmail.com>
Date: Thu, 9 May 2013 17:24:30 +0200
Message-ID: 
 <CAJNb-9qHk_JRL=xzmkk+YTq3W5jYkT_CbV29z43jWVkXMyJ4rA@mail.gmail.com>
Subject: Re: [VOTE] Merge BigCouch
From: Benoit Chesneau <bchesneau@gmail.com>
To: "dev@couchdb.apache.org" <dev@couchdb.apache.org>
Content-Type: text/plain; charset=windows-1252
Content-Transfer-Encoding: quoted-printable

+1 for opening a branch with it. I think I have some questions but
will put them as separate thread.

Thanks for all this code :)

- benoit

On Tue, May 7, 2013 at 10:34 PM, Robert Newson <rnewson@apache.org> wrote:
> Hi All,
>
> I propose to merge in the following work,
> https://github.com/rnewson/couchdb/tree/nebraska-merge-candidate to
> the official Apache CouchDB repository to a new branch (i.e, *not*
> master). Once there, the full CouchDB developer community can begin
> the work to incorporate the code here into an official release.
>
> You do not need to respond if you are in agreement. If there is no
> response in 72 hours, I will assume lazy consensus. If we reach
> consensus, I will start the IP clearance process and then the merge.
>
> As most of you know, Paul Davis and I recently sequestered ourselves
> away from society (in a place called Nebraska) to make this merge
> happen. I want to clarify that this work is not the BigCouch code you
> can see on github.com/cloudant/bigcouch but the Cloudant platform from
> which BigCouch was made. This means it is bang up to date with all the
> bug fixes and feature enhancements we've made in the last eighteen
> months or more. With that clarification made, here are our notes about
> what we achieved, what it means to the project and what isn't yet
> done;
>
> Nebraska Merge Roundup
>
>
> Stats:
>
>
> 1402 - total new commits
>
> 312 - commits written during the merge (will be reduced substantially
> by squashing)
>
> 408 - number of files changed
>
> 21,897 - number of lines added
>
> 4,277 - number of lines removed
>
> A retrospective:
>
> Bob Newson and I have come to the end of our merge sprint on getting
> BigCouch merged into Apache CouchDB. Its been a productive ten days
> here in the midwest. I managed to get Bob out to a bowling alley and
> he managed to get me to a sushi restaurant. In between the cultural
> exchanges we=92ve also managed to get a significant amount of work done
> on the merging as well.
>
>
> The current status of the merge is that we=92ve managed to resolve the
> differences in the single node execution of CouchDB. Both the
> JavaScript and Erlang test suites run with only one failure in the
> Erlang test suite due to a (deliberately) missing constraint on the
> number of operating system processes. This should be a relatively
> straightforward fix but was not prioritized during our limited time to
> work on the larger issues.
>
>
> We merged a large number of performance and stability enhancements
> back into single node CouchDB as well as a number of pure bug fixes.
> The biggest highlight is a brand new compactor that is both faster and
> creates smaller and better organized post-compaction databases.
>
>
> The current status of the merge is that single node operations should
> be completely unaffected as demonstrated by the test suite passing. On
> the other hand we haven=92t yet finished getting the clustered code
> merged to use some of the new changes in single node CouchDB. The
> single most significant portion of this work involves updates to the
> internal cluster API for views to use the recently rewritten indexer
> APIs. This should be a relatively straightforward bit of work that
> we=92ll be finishing over the next few weeks.
>
>
> All in all the merge work done so far has been quite successful. We=92ve
> met our primary goal of getting the code merged in a fashion that does
> not affect single node operation while providing a starting point for
> the larger community to start reviewing the more significant changes
> made. Given the size of the diff between the two code bases we never
> expected to have a fully working clustered solution after ten days of
> work but we have succeeded in providing a base of work that will allow
> us and new contributors to get up to speed quickly.
>
>
> This work, coupled with work by Dave Cottlehuber and Beno=EEt Chesneau
> on updating the build system and various other internal updates, will
> provide a solid foundation for work going forward. Its an exciting
> time for CouchDB and anyone interested should keep an eye on the next
> few releases as we ramp up work on various core aspects of the
> database.
>
>
> We=92ve had an exciting few days working to prepare the road for an
> exciting next twelve to eighteen months. We hope that everyone will
> feel as excited as we do about the next twelve to eighteen months for
> Apache CouchDB. It should be an exciting ride.
>
>
>
> Things we got done
>
>
> * Large update to the source tree layout for Erlang applications. Each
> application now has a src/appname/(c_src|ebin|priv|src) structure. The
> build system has been updated.
>
> * Renamed src/couchdb to src/couch to match the Erlang convention of
> the top directory name matching the Erlang application name.
>
> * Imported Cloudant Erlang applications for clustered CouchDB. These
> are imported with their history by using git subtree and merging the
> top level commit. These are not external deps, development will happen
> within the CouchDB tree. The imported apps are:
>
>
>    * config - A couch_config replacement (Behavior is mostly identical
> to couch_config except how we listen for configuration changes
> internally to allow for smooth hot code upgrade).
>
>    * twig - An rsyslog source replacement for couch_log.
>
>    * rexi - An RPC library. Replaces Erlang=92s built-in rex application
> to avoid costly safety measures in the interest of performance and
> throughput.
>
>    * mem3 - The =93Dynamo=94 part of BigCouch responsible for managing cl=
uster state
>
>    * fabric - The internal cluster-aware CouachDB API
>
>    * ets_lru - A small library application that provides an LRU
> implementation using a couple ets tables.
>
>    * ddoc_cache - Caches design documents on each node for use in
> design handler functions. This uses an ets_lru cache with a very short
> TTL.
>
>    * chttpd - The cluster aware HTTP layer
>
>
> Each imported app also had its build system updated to use Autotools
> along with the necessary updates noted above for the new application
> layouts for existing CouchDB erlang apps.
>
>
> * Merged a large amount of updates and fixes to couch_replicator based
> on work done internally at Cloudant. Unfortunately due to an error
> when we created our internal clone we lost a bit of history in some of
> the initial merge and have a big commit that affects
> couch_replicator_manager mostly. There are a number of other commits
> related to couch_replicator that resolve the single node vs. clustered
> differences. Some noticeable couch_replicator features:
>
>
>    * Optionally disable checkpoints so that replication can work when
> a source is read only. This should only be used for smaller databases
> as each replication call has to scan the entire source database on
> each invocation.
>
>    * A new changes_pending field in the _active_tasks output
>
>    * A fix to the continuous replication to automatically reconnect to
> a continuous changes feed when it sees a last_seq value. This allows
> for the source to selectively recycle the HTTP connections used which
> can be quite useful for =93permanent=94 replications.
>
>    * A multitude of smaller bug fix and stability enhancements.
>
>
> Updates to single node couch:
>
>
>  * We changed the by_seq tree to store a copy of the #full_doc_info{}
> record instead of the #doc_info{} record. This gives significant speed
> improvements for compaction and replication and generally anything
> that needs to walk the by_seq tree and access document bodies
> internally.
>
>  * We rewrote the compactor to be significantly faster as well as
> provides significantly better compacted databases. The two main halves
> are to use a temp file and replace the use of btrees in the temp file.
> The temp file only contains a temporary copy of the document ids. At
> the end of a compaction run we then rebuild the by_id btree in the
> compaction file from this temp file. The reason this helps so much is
> that the compaction is based on the update_seq btree, which for most
> cases means that the id tree is updated in roughly random order which
> is very bad for our append only btrees. By using the tmp file we can
> stream it in order back into the compacted db file at the end of
> compacting, generating a minimum amount of garbage in the process. The
> other upgrade was to implement an external merge sort module
> (couch_emsort) that is used with this temporary file.
>
>  * Reject updates to design docs that introduce updates that break
> compilation for source code. Currently we only check map and reduce
> calls as the other should provide user visible errors instead of
> inexplicably empty views.
>
> because my OCD kicked in and I was unable to resist.
>
>  * Reverted a change made a long time ago that uses two file
> descriptors for each database. See the todo list.
>
>  * The reason to remove the second fd is so that we can rewrite ref
> counting. Better ref counting makes everyone happy, but the real
> reason is for this next bullet point:
>
>  * Optimize couch_server to not require a round trip message pass for
> opening a database that=92s in the LRU. This is a significant
> performance boost for high concurrency access. We also optimized
> couch_server internals to not blow up when it=92s under load.
>
>  * Introduce a #leaf{} record into the revision trees. This is never
> written to disk but makes internal code a lot cleaner when dealing
> with multiple versions of rev tree values.
>
>  * Some changes to couch_changes to enable clustered access. Also some
> general cleanup
>
>  * Internal changes to how CouchDB is booted in Erlang land. Not very
> sexy but this removes a lot of complicated un-Erlangy bits. We still
> have a bit of work left here.
>
>  * btree chunk sizes are now configurable which can allow people to
> adjust the RAM/speed tradeoffs a bit more.
>
>  * We now load update validation functions on the first write. This is
> a cluster-motivated change because the clustered version of this call
> is expensive and can lead to race conditions when opening a bunch of
> db shards simultaneously. This should be invisible to external
> clients.
>
>  * Disabled conflict detection for local docs. They don=92t replicate so
> there=92s no point. This just led to clusters getting stuck and confused
> when there were lots of replications happening.
>
>  * Changes to the multipart/mime parsing code. Necessary for clustered
> attachment uploads to split the incoming data  stream into N copies.
>
>  * Don=92t use init:restart/0 when reloading the ICU driver. I think
> this has a bug. But we should rewrite this driver to be a NIF anyway.
>
>  * New couch OS process manager. Significantly faster access to OS
> processes under heavy load. This replaces the hard limit with a soft
> limit. Process spawned over the soft limit will be used until they=92ve
> sat idle for a few minutes and then be closed. We have a todo item to
> add the hard ceiling back in (while keeping the soft ceiling).
>
>  * Automatically replace some easily identifiable JS reductions with
> their builtin counterparts. Uses a regex to do the detection so its
> not too smart.
>
>  * Improved view updater write batch.
>
>  * Updates to couchjs=92 views.js to improve index update speeds
>
>  * Updates to the _stats bultin reduce to allow reduces to work over
> emitted stats objects. Sometimes clients have summary data in a doc,
> and this allows them to combine stats if they follow the same pattern
> as the builtin expects.
>
>  * Added a config:reload() that is accessible by POST=92ing to
> _config/_reload. Used by the JS tests to reset the config to what's on
> disk. This should prevent those test run failures where a test fails
> leaving the config in a bad state causing all subsequent tests to
> fail. I think. Maybe.
>
>  * Databases are deleted synchronously in the test suite. We may need
> to address this on Windows. But it does seem to reduce the number of
> =93{error, file_exists}=94 failures.
>
>  * I reimplemented the JS restartServer() function. There=92s a new
> _restart/token URL that will given a unique value for each instance of
> the Erlang VM. To run a restart we grab the current token value, hit
> _restart, then wait till we get a successful response with a different
> token. This appears to have made the restart strategy more robust.
>
>
>
> Things that need doing
>
>
> IP Clearance -
>
>
> We=92ll need to track down if we have the CCLA as well as look at each
> source file added to make sure each one is strictly from Cloudant or
> has an amenable license. I=92m pretty sure that the only one of interest
> is trunc_io.erl but we need to be thorough.
>
> documentation -
>
>
> There shouldn=92t be much here since the entire point of this merge was
> to not change the visible behavior of single node couch. A few things
> to add about the testing endpoints. Maybe an update to the compaction
> section mention the two new file names used.
>
>
> Copyright notices -
>
>
> We need to strip out copyright notices from individual files and make
> sure all files have a standard Apache License v2 header.
>
>
> clustered vhosts -
>
>
> We=92ve never implemented this at Cloudant. We either need to write a
> cluster or go back and tell people to use HAProxy (or similar) for
> such things.
>
>
> twig -
>
>
> We need to add another output type to twig that is configurable in
> some manner. Right now we spit out entire rsyslog records which isn=92t
> useful for most people. We=92ll need to implement the file writer from
> couch_log as well as update the _log HTTP handler to know when it can
> and can=92t expect to find data on disk.
>
>
> fabric -
>
>
> This is going to need a lot of work. Specifically view access is going
> to need to be updated to work with couch_mrview and friends.
>
>
> Boot a dev cluster -
>
>
> Once we fix up the clustering code we=92ll need to write instructions
> and scripts for pulling up a dev cluster.
>
>
> OTP stuff -
>
>
> We=92ve updated each app but we still need to pull some parts out of
> couchdb into their own application. Specifically the HTTP layer needs
> its own app. We could probably pull out the os process/query_servers
> as well as the os daemons and friends. Once done we need to update the
> supervision trees so we don=92t have things like couch starting and
> managing the replication manager process.
>
>
> ddoc_cache -
>
>
> Wire this up in couch_httpd_db to actually be used. Right now its only
> used in chttpd.
>
>
> couch_file upgrade -
>
>
> The revert to remove the second updater_fd from each #db{} record
> means that we=92re back in the original position of files appearing to
> slow down significantly under load. Since the initial hammer approach
> of just adding a second fd we=92ve since discovered that the underlying
> bug is due to the way that message passing works combined with
> Erlang=92s file io. Significantly though is the fact that the fix is
> rather simple to implement. A first draft of this work is on an old
> branch of mine here:
>
>
>    https://github.com/davisp/couchdb/commit/d856878
>
>
> finish the size calculating changes -
>
>
> The #leaf{} record change is to enable us to add more data size
> calculations. CouchDB master calculates a data size that account for
> all bytes that are active in a .couch file. Cloudant is interested in
> the total size of uncompressed docs and attachments minus the internal
> overhead of btrees. And there=92s a fourth number to calculate based on
> the compression level used. Having each of these numbers will be
> useful as well as the calculations they=92ll enable (ie, dead bytes in
> file, bytes used for overhead, compression ratio achieved, etc).
>
>
> couch_proc_manager -
>
>
> We need to implement the hard ceiling for capping the number of OS
> processes. We=92ve started seeing a need for this at Cloudant with some
> work loads so motivation to fix this is high. The only failing etap is
> the assertion of this ceiling.
>
>
> Synchronous db delete on Windows -
>
>
> I did this because running the test suite was driving me bonkers. I
> need to ask Dave about how this behaves on Windows (my guess is not
> well) but I think we can close things up so that it works better than
> the status quo.