couchdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Anderson <>
Subject Re: Partitioned Clusters
Date Fri, 20 Feb 2009 22:55:25 GMT
On Fri, Feb 20, 2009 at 2:45 PM, Damien Katz <> wrote:
> On Feb 20, 2009, at 4:37 PM, Stefan Karpinski wrote:
>>> Trees would be overkill except for with very large clusters.
>>> With CouchDB map views, you need to combine results from every node in a
>>> big merge sort. If you combine all results at a single node, the single
>>> clients ability to simultaneously pull data and sort data from all other
>>> nodes may become the bottleneck. So to parallelize, you have multiple
>>> nodes
>>> doing a merge sort of sub nodes , then sending those results to another
>>> node
>>> to be combined further, etc.  The same with with the reduce views, but
>>> instead of a merge sort it's just rereducing results. The natural "shape"
>>> of
>>> that computation is a tree, with only the final root node at the top
>>> being
>>> the bottleneck, but now it has to maintain connections and merge the sort
>>> values from far fewer nodes.
>>> -Damien
>> That makes sense and it clarifies one of my questions about this topic. Is
>> the goal of partitioned clustering to increase performance for very large
>> data sets, or to increase reliability? It would seem from this answere
>> that
>> the goal is to increase query performance by distributing the query
>> processing, and not to increase reliability.
> I see partitioning and clustering as 2 different things. Partitioning is
> data partitioning, spreading the data out across nodes, no node having the
> complete database. Clustering is nodes having the same, or nearly the same
> data (they might be behind on replicating changes, but otherwise they have
> the same data).
> Partitioning would primarily increase write performance (updates happening
> concurrently on many nodes) and the size of the data set. Partitioning helps
> with client read scalability, but only for document reads, not views
> queries. Partitioning alone could reduce reliability, depending how tolerant
> you are to missing portions of the database.
> Clustering would primarily address database reliability (failover), address
> client read scalability for docs and views. Clustering doesn't help much
> with write performance because even if you spread out the update load, the
> replication as the cluster syncs up means every node gets the update anyway.
> It might be useful to deal with update spikes, where you get a bunch of
> updates at once and can wait for the replication delay to get everyone
> synced back up.
> For really big, really reliable database, I'd have clusters of partitions,
> where the database is partitioned N ways, each each partition have at least
> M identical cluster members. Increase N for larger databases and update
> load, M for higher availability and read load.

Thanks for the clarification.

Can you say anything about how you see rebalancing working?

Chris Anderson

View raw message