couchdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Anderson <>
Subject Partitioned Clusters
Date Fri, 20 Feb 2009 01:54:58 GMT
There's been renewed interest in multi-node clusters lately. CouchDB
has a clustering model in mind, but I don't know that we've dedicated
a thread to it on dev@.

This is definitely a feature we're ready to begin working on for
CouchDB. I'll say a little about the proposed model, and then some
about the implementation.

To be clear, I'm only talking about the model for managing databases
that need to be split across a relatively stable cluster of nodes.
This is more GFS than P2P.

The clustering model we've discussed informally is a tree of nodes. We
take advantage of the tree structure so that only immediate parents
need information about changes to their child nodes. If a child node
reaches the capacity to trigger a split, only it's immediate parent
needs to be informed.

Another advantage to constructing a tree of nodes is that we get O(log
N) proxy hops where N is number of leaf nodes. Even very large
clusters will result in a shallow tree of nodes. The leaf nodes of the
tree handle storage and the parent nodes handle proxying. Proxy nodes
can also take advantage of local disk for query caching.

In this model documents are saved to a leaf node depending on a hash
of the docid. This means that lookups are easy, and need only to touch
the leaf node which holds the doc. Redundancy can be provided by
maintaining R replicas of every leaf node.

View queries, on the other hand, must be handled by every node. The
requests are proxied down the tree to leaf nodes, which respond
normally. Each proxy node then runs a merge sort algorithm (which can
sort in constant space proportional to # of input streams) on the view
results. This can happen recursively if the tree is deep.

The final HTTP view response should look as though it came from a
single node of CouchDB.

I'm interested in implementing this where the requests and responses
are transferred as Erlang terms between nodes, and only converted to
JSON when they reach the root node of the cluster. This is not
something we should aim to do before 0.9, but now is not too early to
start laying the groundwork.

I also think the tree of nodes model is compelling partly because it
is easy to reason about (and similar to how we manage docs on disk).

What do you think?


Chris Anderson

View raw message