couchdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Damien Katz <>
Subject Re: Partitioned Clusters
Date Fri, 20 Feb 2009 22:45:52 GMT

On Feb 20, 2009, at 4:37 PM, Stefan Karpinski wrote:

>> Trees would be overkill except for with very large clusters.
>> With CouchDB map views, you need to combine results from every node  
>> in a
>> big merge sort. If you combine all results at a single node, the  
>> single
>> clients ability to simultaneously pull data and sort data from all  
>> other
>> nodes may become the bottleneck. So to parallelize, you have  
>> multiple nodes
>> doing a merge sort of sub nodes , then sending those results to  
>> another node
>> to be combined further, etc.  The same with with the reduce views,  
>> but
>> instead of a merge sort it's just rereducing results. The natural  
>> "shape" of
>> that computation is a tree, with only the final root node at the  
>> top being
>> the bottleneck, but now it has to maintain connections and merge  
>> the sort
>> values from far fewer nodes.
>> -Damien
> That makes sense and it clarifies one of my questions about this  
> topic. Is
> the goal of partitioned clustering to increase performance for very  
> large
> data sets, or to increase reliability? It would seem from this  
> answere that
> the goal is to increase query performance by distributing the query
> processing, and not to increase reliability.

I see partitioning and clustering as 2 different things. Partitioning  
is data partitioning, spreading the data out across nodes, no node  
having the complete database. Clustering is nodes having the same, or  
nearly the same data (they might be behind on replicating changes, but  
otherwise they have the same data).

Partitioning would primarily increase write performance (updates  
happening concurrently on many nodes) and the size of the data set.  
Partitioning helps with client read scalability, but only for document  
reads, not views queries. Partitioning alone could reduce reliability,  
depending how tolerant you are to missing portions of the database.

Clustering would primarily address database reliability (failover),  
address client read scalability for docs and views. Clustering doesn't  
help much with write performance because even if you spread out the  
update load, the replication as the cluster syncs up means every node  
gets the update anyway. It might be useful to deal with update spikes,  
where you get a bunch of updates at once and can wait for the  
replication delay to get everyone synced back up.

For really big, really reliable database, I'd have clusters of  
partitions, where the database is partitioned N ways, each each  
partition have at least M identical cluster members. Increase N for  
larger databases and update load, M for higher availability and read  


View raw message