couchdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Benoit Chesneau <bchesn...@gmail.com>
Subject improve the bigcouch and rcouch merges
Date Tue, 21 Jan 2014 10:33:19 GMT
Hi all,

I was reviewing the bigcouch and the rcouch branch this morning and I
think it is about time to start a real merge. Waiting that each merges
reach a working version before starting any work on the final product
is quite unproductive to say less.

On the contrary what I see in the current branches let me think, it is a
good time to start some work to improve our code base in view of
achieving the 3 main goals we fixed a long time ago during the summit
and that were confirmed in December last year, ie.:

- add a cluster facility to Apache CouchdDB
- allows Apache CouchDB to be embedded in other Erlang applications
  (just like mnesia the standard database library in Erlang)
- make Apache CouchDB more *OTP*ish

The changes in bigcouch are indeed quite more monolithic than the rcouch
changes by adapting Apache Couchdb to be able to run in a cluster
environment. While the cluster management part is quite isolated (mem3,
rexi, chttpd), others parts of bigcouch are wrapping or modifying the
apache couchdb internals:

- fabric is adding fabric_rpc and fabric_rpc2 wrapping some  modules and
  functions in the couchdb core to be able to call them on the cluster.
It should be noted that the cluster part is using fabric to do all calls
to each couchdb nodes on the cluster and return/merge the results.
- some changes has been done to optimise the core: removing ref
  counting, adding ets_lru to handle caching, db initialization has been
changed to look at the cluster to fetch ddocs to set validate funcs.
- couch_replicator has been modified to work in a cluster
- bigcouch also changed the way the configuration is handled in couchdb,
  making it more evented. Which is better than the current version but
still rely on an INI file.
- logs are now handled by twig a specific application created by
  Cloudant for that purpose.


rcouch changes consist mostly in the following:

- making Apache CouchDB an OTP release. It also extracts some code and
  revisits the supervision like  couch_httpd to transform it as a
standalone application, and make couch_replicator a full app
- add some features as standalone Erlang apps
- adding view changes: which edit couch_mrview, couch_index,
  couch_replicator by adding new features without changing the current
one or internals
- changes the core by adding some optimizations and new features: remove
  ref counting, add new caching, add validate_doc_read. Actually
these optimizations have not been merge in view of using those in bigcouch.
The collation has been replaced by a nif available as a standalone
Erlang application couch_collate.
- A lot of changes have been done in the build process
- logs are handled by lager an application created by Basho commonly
  used these days in the Erlang world.

So rcouch and bigcouch are conflicting on the following right now:

- couch_replicator
- couch_index, couch_mrview
- internals: some parts of the code. the collation, jiffy changes are
  not the same, ... .
- build process
- logging system


The goal of making couchdb embedabble in another Erlang application
is not achieved in bigcouch and still difficult with rcouch.

- Difficult with rcouch because the configuration is actually strongly
based on settings passed via an ini file.
- Difficult with bigcouch  because of the internal changes implying the
  usage by default of a cluster (like loading ddocs at initialization).
Also it has the same problem with the ini files.
- Both handle authorization at the core level instead of handling
  it in a different layer mixing code between the internal api and the
HTTP API.

If we are able to embed the couchdb core in any Erlang application, it
will considerably help us to merge bigcouch and rcouch quietly, easily,
with people working in parallel to achieve it. Building a plugin will be
a lot easier as well. The good thing is that both projects have the
roots to do it, it is a matter to merge some code and refactor some
internals:

- Have a clean internal API. Fabric the cluster library of bigcouch is
  adding 2  modules wrapping the couchdb API  (couch_db, couch_stream,
...) `fabric_rpc` and `fabric_rpc2`.  couch_httpd* modules are also
wrapping it to send the result to the HTTP clients. Both are adding
a layer over the internals just to be able to use the core which is really
inefficient. It forces Erlang application that want to call directly
couchdb (like couchc [1]) to also wrap these internals to make them
useful. We should on the contrary offer a clean API at the core level
usable by fabric or any Erlang App. The API offered by the `fabric_rpc*`
or `couchc` should be our internals.

So I propose as a first merge action to refactor the internal API so
it can be used directly by fabric and future Erlang applications
embedding or calling directly couchdb.

- Make the auth* apart - remove all `#user_ctx{}` calls from
  the couch_* modules- to handle the it in the transport or application
level. HTTP for now.By doing that we make the `couch` application
completely independent of the transport, so any erlang app can embed it
and propose its own API to enter the data. (just like mnesia). It will
also remove all the extra code we have to force the authorization by
faking the user context when it needs to use the internal api. The
auth* should be provided as an extension imo.

- removing validate_* initialization in the core db level and let it
  to the transport or the application. For example it could be done when
opening the DB via the HTTP api by wrapping the open_db call. Some
applications may not need it at all. The core should still provide
hooks to do these actions, but the way the hooks are passed to the
database should be an application decision. couch_httpd is an
application exposing the couchdB API to other via HTTP.

- adding special databases hooks only on the application level. Right
  now we have specific databases handled in the core code (`_users` and
and `_replicator`). We should instead pass this hook on another level,
like when we initialize the couch_replicator application for example.

- making the configuration created by an INI file optional. We could have
  a default configuration set via the API that can be feed by later by
the ini file or any other config system.

- merging any optimizations at the same time (ref counting replacement).


Imo  if we do the work above it will allows us to speed the merge of
both rcouch and bigcouch. The second step would be having the fabric api
to use this clean api, and modifying couch_htpd and couch_replicator to
use it. Bonus point, it would ease unitesting by isolating the cluster
part from the core.

I probably forgot some features that could also be done during the
merge, like merging couch_index and couch_mrview in one app or at least
having `_all_docs` handled by couch_httpd but these changes can be done
at another level imo. Also the `_all_dbs` is handled differently, in
bigcouh it is handled by recording all the dbs in a database, in couchdb
we are right now only relying only on the fs. The bigcouch solution
should be the default imo.


Anyway, I wish we can start this work ASAP rather than just working on 2
branches in parallel.

Any feedback is welcome,

- benoit

[1] http://github.com/benoitc/couchc

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message