couchdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Randall Leeds <>
Subject Re: CouchDB Cluster/Partition GSoC
Date Wed, 01 Apr 2009 05:59:13 GMT
On Sun, Mar 29, 2009 at 22:12, Adam Kocoloski <> wrote:

> Hi Randall, cool!  I can chime in on a couple of the questions ...

Adam, thanks for your quick reply and thorough comments. The more people
chime in on this discussion the better I can make the proposal, both in
terms of likelihood for acceptance by a mentor/Google and the value of the
resulting work for the community. I will aim to post a formalized draft of
the proposal on my GSoC registration page sometime tomorrow and open it up
for comments. Submission deadline is Friday.

> On Mar 29, 2009, at 8:59 PM, Randall Leeds wrote:
>  1) What's required to make CouchDB a full OTP application? Isn't it using
>> gen_server already?
> Yes, in fact CouchDB is already an OTP application using supervisors,
> gen_servers, and gen_events.  There are situations in which it could do a
> better job of adhering to OTP principles, and it could probably also use
> some refactoring to make the partitioning code fit in easily.

Cool. We'll see what we need here as we clarify other issues.

>  2) What about _all_docs and seq-num?
> I presume _all_docs gets merged like any other view.  _all_docs_by_seq is a
> different story.  In the current code the sequence number is incremented by
> one for every update.  If we want to preserve that behavior in partitioned
> databases we need some sort of consensus algorithm or master server to
> choose the next sequence number.  It could easily turn into a bottleneck or
> single point-of-failure if we're not careful.
> The alternatives are to a) abandon the current format for update sequences
> in favor of vector clocks or something more opaque, or b) have
> _all_docs_by_seq be strictly a node-local query.  I'd prefer the former, as
> I think it will be useful for e.g. external indexers to treat the
> partitioned database just like a single server one.  If we do the latter, I
> think it means that either the external indexers have to be installed on
> every node, or at least they have to be aware of all the partitions.

If at all possible I think we should have the entire partition group appear
as a single server from the outside. One thing that comes to mind here is a
question about sequence numbers. Vector clocks only guarantee a partial
ordering, but I'm under the impression that currently seq numbers have a
strict ordering.

Database sequence numbers are used in replication and in determining whether
views need refreshing. Anything else I'm missing? Currently it seems there
is no tracking of which updates actually change a view index (comment on
line 588 of couch_httpd_view.erl on trunk). Improving this would be a nice
win. See my answer to number (3).

The easy way to manage seq numbers is to let one node be the write master
for any cluster. (The root node of any partition group could actually be a
cluster, but if writes always go through a master the master can maintain
the global sequence number for the partition group).

> One other thing that bothers me is the merge-sort required for every view
> lookup.  In *really* large clusters it won't be good if queries for a single
> key in a view have to hit each partition.  We could have an alternative
> structure where each view gets partitioned much like the document data while
> its built.  I worry that a view partitioned in this way may need frequent
> rebalancing during the build, since view keys are probably not going to be
> uniformly distributed.  Nevertheless, I think the benefit of having many
> view queries only hit a small subset of nodes in the cluster is pretty huge.

I agree that the merge-sort is something we need to look at carefully. We
should never hit a node in a view query unless it has data we need. We
certainly can't avoid merging altogether, but we can make an effort to do
smart rebalancing later on.

>  3) Can we agree on a proposed solution to the layout of partition nodes? I
>> like the tree solution, as long as it is extremely flexible wrt tree depth.
> I'm not sure we're ready to do that.  In fact, I think we may need to
> implement a couple of different topologies and profile them to see what
> works best.  The tree topology is an interesting idea, but it may

That was a silly question. I didn't expect these questions to be easy. That
should have read as a discussion prompt rather than a call for consensus.

We should probably clarify the impetus for a tree structure. Computationally
intensive reduces is the primary use case and the tree is a good way to get
speedup here. In the case of a map-only view, we still need to merge and
aggregate the results from each shard. This merge needs to happen somewhere,
likely either at the node that's servicing the request or recursively up the
tree. In either case, we agree that there's not much win if every view
request has to hit every node. Therefore, I think we may need to start
tracking which updates affect the view index.

>  4) Should the consistent hashing algorithm map ids to leaf nodes or just
>> to
>> children? I lean toward children because it encapsulates knowledge about
>> the
>> layout of subtrees at each tree level.
> If the algorithm maps to children, does that mean every document lookup has
> to traverse the tree?  I'm not sure that's a great idea.  Up to ~100 nodes I
> think it may be better to have all document lookups take O(1) hops.  I think
> distributed Erlang can keep a system of that size globally connected without
> too much trouble.

Yes, I think you're right. We should definitely optimize for O(1) reads for
document lookups and hash directly to leaf nodes.

So, we need a consistent hash implementation. I will include this in the
>From there, where should we go?

Thanks in advance,

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message