couchdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Adam Kocoloski <>
Subject Re: CouchDB Cluster/Partition GSoC
Date Mon, 30 Mar 2009 02:12:18 GMT
Hi Randall, cool!  I can chime in on a couple of the questions ...

On Mar 29, 2009, at 8:59 PM, Randall Leeds wrote:

> 1) What's required to make CouchDB a full OTP application? Isn't it  
> using gen_server already?

Yes, in fact CouchDB is already an OTP application using supervisors,  
gen_servers, and gen_events.  There are situations in which it could  
do a better job of adhering to OTP principles, and it could probably  
also use some refactoring to make the partitioning code fit in easily.

> 2) What about _all_docs and seq-num?

I presume _all_docs gets merged like any other view.  _all_docs_by_seq  
is a different story.  In the current code the sequence number is  
incremented by one for every update.  If we want to preserve that  
behavior in partitioned databases we need some sort of consensus  
algorithm or master server to choose the next sequence number.  It  
could easily turn into a bottleneck or single point-of-failure if  
we're not careful.

The alternatives are to a) abandon the current format for update  
sequences in favor of vector clocks or something more opaque, or b)  
have _all_docs_by_seq be strictly a node-local query.  I'd prefer the  
former, as I think it will be useful for e.g. external indexers to  
treat the partitioned database just like a single server one.  If we  
do the latter, I think it means that either the external indexers have  
to be installed on every node, or at least they have to be aware of  
all the partitions.

One other thing that bothers me is the merge-sort required for every  
view lookup.  In *really* large clusters it won't be good if queries  
for a single key in a view have to hit each partition.  We could have  
an alternative structure where each view gets partitioned much like  
the document data while its built.  I worry that a view partitioned in  
this way may need frequent rebalancing during the build, since view  
keys are probably not going to be uniformly distributed.   
Nevertheless, I think the benefit of having many view queries only hit  
a small subset of nodes in the cluster is pretty huge.

> 3) Can we agree on a proposed solution to the layout of partition  
> nodes? I like the tree solution, as long as it is extremely flexible  
> wrt tree depth.

I'm not sure we're ready to do that.  In fact, I think we may need to  
implement a couple of different topologies and profile them to see  
what works best.  The tree topology is an interesting idea, but it may  
turn out that passing view results up the tree is slower than just  
sending them directly to the final destination and having that server  
do the rest of the work.  Off the cuff I think trees may be a great  
choice for computationally intensive reduce functions, but other views  
where the size of the data is large relative to the computational  
requirements may be better off minimizing the number of copies of the  
data that need to be made.

> 4) Should the consistent hashing algorithm map ids to leaf nodes or  
> just to
> children? I lean toward children because it encapsulates knowledge  
> about the
> layout of subtrees at each tree level.

If the algorithm maps to children, does that mean every document  
lookup has to traverse the tree?  I'm not sure that's a great idea.   
Up to ~100 nodes I think it may be better to have all document lookups  
take O(1) hops.  I think distributed Erlang can keep a system of that  
size globally connected without too much trouble.

Cheers, Adam

View raw message