couchdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Randall Leeds <>
Subject CouchDB Cluster/Partition GSoC
Date Mon, 30 Mar 2009 00:59:33 GMT
To start, I'd like to introduce myself as I've been off and on contributing
in tiny ways to dev list activity and a little IRC chatter, but not super
visible in the community.

My name is Randall and I'm a student at Brown University in Providence,
Rhode Island, USA. I've got one more semester ahead of me in my
undergraduate degree. I've been working with CouchDB on the Melkjug[1]
project since June and have been intermittently active with
couchdb-python[2] as a committer fixing small bugs.

I'd like to create and polish a proposal this week for submission as a
Google Summer of Code Project.

To that end, this thread is to start the drafting process and determine a
prioritized list of tasks and inter-task-dependencies required to get a
smooth clustering and partitioning experience in CouchDB supported.

Skip the next section if you just want to read my questions and jump right
into the discussion.

Otherwise, here's a brief overview of background information:

A clarification of terms:

On Fri, Feb 20, 2009 at 2:45 PM, Damien Katz <> wrote:
> I see partitioning and clustering as 2 different things. Partitioning is
> data partitioning, spreading the data out across nodes, no node having the
> complete database. Clustering is nodes having the same, or nearly the same
> data (they might be behind on replicating changes, but otherwise they have
> the same data).

Existing partitioning proposal[3] on the wiki:

>From an e-mail between myself and Chris A on first steps:

Chris wrote:
>I think as far as writing goes, there's still more work to be done on
>design, but there are some pieces that can be written first:
> * consistent hashing Erlang proxy (start out with HTTP)
> * view merging across partitioned nodes
>These two can be run as their own software at first, so they can sit
>in front of a cluster of CouchDB machines without any changes
>happening to CouchDB. Once they work, they can be tied to CouchDB
>using Erlang terms and IPC instead of JSON/HTTP.
>There are some design questions about a partitioned CouchDB that we
>should probably take up on the list:
> * what about _all_docs and other node-global queries?
> * does a cluster use a single seq-num or does each node have it's own?

Finally, the great folks from Meebo have recently posted couchdb-lounge[4]
which uses an nginx proxy to make a "partitioning/clustering framework for

The questions we should address are prioritized here:
1) What's required to make CouchDB a full OTP application? Isn't it using
gen_server already?
2) What about _all_docs and seq-num?
3) Can we agree on a proposed solution to the layout of partition nodes? I
like the tree solution, as long as it is extremely flexible wrt tree depth.
4) Should the consistent hashing algorithm map ids to leaf nodes or just to
children? I lean toward children because it encapsulates knowledge about the
layout of subtrees at each tree level.

Submissions for GSoC are due by Friday so I would appreciate any help in
polishing a proposal that will best serve the needs of the CouchDB
community. Hopefully this generates some initial discussion that will lead
me to a draft proposal in the next couple of days which I will post for
revision and comment until I submit it at the end of the week.

Thanks in advance,


  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message