couchdb-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Couchdb Wiki] Update of "Partitioning proposal" by BenBrowning
Date Thu, 26 Feb 2009 23:47:13 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Couchdb Wiki" for change notification.

The following page has been changed by BenBrowning:
http://wiki.apache.org/couchdb/Partitioning_proposal

New page:
This page is for documenting and discussing the proposed database partitioning support in
CouchDB. Most of the features described here don't exist yet. If you'd like to help with implementation
please bring it up on the dev@ mailing list.

== High-Level Design Goals ==
 * Increase total write throughput by allowing a database to scale across multiple physical
machines
 * Increase total read throughput for multiple concurrent clients when fetching documents
by id
 * Make the partitioned database appear as a single database via the HTTP API, keeping compatibility
with existing client implementations
 * Allow the partition topology to be configurable to support different performance needs

== Initial Implementation Thoughts ==

=== Avoid JSON Overhead ===
Partition nodes should communicate via native Erlang terms instead of doing Erlang -> JSON
-> Erlang conversion. This implies an Erlang API for interacting with Couch which doesn't
officially exist yet.

=== Tree Partition Topology ===
Support a tree partition topology that can be as shallow or deep as the user needs. It should
be possible to have a flat tree with only one root and many leaf nodes, a binary tree structure,
or any variant in-between.

=== Consistent Hashing Algorithm ===
The mapping of IDs to nodes should use a [http://www.spiteful.com/2008/03/17/programmers-toolbox-part-3-consistent-hashing/
Consistent Hashing Algorithm]. What hasn't been decided on fully (I don't think) is if a proxy
node just maps IDs to its direct children of if a proxy node knows how to map IDs directly
to a leaf node all the way down the tree. With this type of hashing algorithm, adding or removing
a storage node just requires moving data around on its neighbors and not the entire system.
Also, node failover (which is out of the scope of this document) becomes easier since you
know exactly what data needs to be replicated to which servers to maintain a redundant copy
of each node and the failed node's load gets spread among the remaining servers instead of
just one.

=== Proxy and Storage Nodes ===
Allow a node to be a proxy node, a storage node, or both. Storage nodes contain the data and
would typically be the leaf nodes of a tree. Proxy nodes combine results from multiple storage
nodes before passing them up the tree (or back to the client). The distinction is entirely
in configuration and only exists to simplify the mental model. If a node's ID hash points
all requests to other nodes, that node is a proxy node. If a node's ID hash points all requests
to itself, it is a storage node. If a node's ID hash points some requests to other nodes and
some requests to itself, it is both.

Mime
View raw message