couchdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Adam Kocoloski <>
Subject Re: CouchDB Cluster/Partition GSoC
Date Wed, 01 Apr 2009 15:37:02 GMT
On Apr 1, 2009, at 11:03 AM, Chris Anderson wrote:

>>>  2) What about _all_docs and seq-num?
>>> I presume _all_docs gets merged like any other view.   
>>> _all_docs_by_seq is a
>>> different story.  In the current code the sequence number is  
>>> incremented by
>>> one for every update.  If we want to preserve that behavior in  
>>> partitioned
>>> databases we need some sort of consensus algorithm or master  
>>> server to
>>> choose the next sequence number.  It could easily turn into a  
>>> bottleneck or
>>> single point-of-failure if we're not careful.
>>> The alternatives are to a) abandon the current format for update  
>>> sequences
>>> in favor of vector clocks or something more opaque, or b) have
>>> _all_docs_by_seq be strictly a node-local query.  I'd prefer the  
>>> former, as
>>> I think it will be useful for e.g. external indexers to treat the
>>> partitioned database just like a single server one.  If we do the  
>>> latter, I
>>> think it means that either the external indexers have to be  
>>> installed on
>>> every node, or at least they have to be aware of all the partitions.
>> If at all possible I think we should have the entire partition  
>> group appear
>> as a single server from the outside. One thing that comes to mind  
>> here is a
>> question about sequence numbers. Vector clocks only guarantee a  
>> partial
>> ordering, but I'm under the impression that currently seq numbers  
>> have a
>> strict ordering.
>> Database sequence numbers are used in replication and in  
>> determining whether
>> views need refreshing. Anything else I'm missing? Currently it  
>> seems there
>> is no tracking of which updates actually change a view index  
>> (comment on
>> line 588 of couch_httpd_view.erl on trunk). Improving this would be  
>> a nice
>> win. See my answer to number (3).
>> The easy way to manage seq numbers is to let one node be the write  
>> master
>> for any cluster. (The root node of any partition group could  
>> actually be a
>> cluster, but if writes always go through a master the master can  
>> maintain
>> the global sequence number for the partition group).
> The problem with this approach is that the main use-case for
> partitioning is when your incoming writes exceed the capacity of a
> single node. By partitioning the key-space, you can get more
> write-throughput.

I think Randall was saying requests just have to originate at the  
master node.  That master node could do nothing more than assign a  
sequence number, choose a node, and proxy the request down the tree  
for the heavy lifting.  I bet we could get pretty good throughput, but  
I still worry about this approach for availability reasons.

> I'm not sure that an update-seq per node is such a bad thing, as it
> will require any external indexers to be deployed in a 1-to-1
> relationship to nodes, which automatically balances the load for the
> indexer as well. With a merged seq-id, users would be encouraged to
> partition CouchDB without bothering to partition indexers. Maybe this
> is acceptable in some cases, but not in the general case.

So, the vector clock approach still has a per-node update sequence for  
each node's local clock, it just does the best job possible of  
globally ordering those per-node sequences.  We could easily offer  
local update sequences as well via some query string parameter.  I  
understand the desire to encourage partitioned indexers, but I believe  
that won't always be possible.  Bottom line, I think we should support  
global indexing of a partitioned DB.

>>> One other thing that bothers me is the merge-sort required for  
>>> every view
>>> lookup.  In *really* large clusters it won't be good if queries  
>>> for a single
>>> key in a view have to hit each partition.  We could have an  
>>> alternative
>>> structure where each view gets partitioned much like the document  
>>> data while
>>> its built.  I worry that a view partitioned in this way may need  
>>> frequent
>>> rebalancing during the build, since view keys are probably not  
>>> going to be
>>> uniformly distributed.  Nevertheless, I think the benefit of  
>>> having many
>>> view queries only hit a small subset of nodes in the cluster is  
>>> pretty huge.
>> I agree that the merge-sort is something we need to look at  
>> carefully. We
>> should never hit a node in a view query unless it has data we need.  
>> We
>> certainly can't avoid merging altogether, but we can make an effort  
>> to do
>> smart rebalancing later on.
> I think rebalancing aka shuffling will turn out to be one of those
> devil-in-the-details things. Because any document can emit any key, in
> the case of rebalancing, if you have to rebuild part of an index due
> to node-failure, you'd need to re-request from every other node, any
> view rows that might fit in that range. This requires every node to
> know about every other node.
> If view data is stored with document data, then nodes need only know
> about their child nodes, in the tree structure. Recovering from
> node-failure is easy: just swap in the failed node's hot-backup, and
> regenerate the views on it.
> I agree that the cost of merge sort will be ongoing, but I think the
> simplicity of this approach at least indicates that we should take it
> for the initial work. If we consider rebalancing an optimization, we
> can add it later.
> I think a better optimization would be to have inner nodes of the tree
> lazily cache the view rows of their children. This way the computation
> is spread out but the hops for popular queries can be mostly
> eliminated.

Ok, +1 from me.  I totally agree that rebalancing can get hairy, but  
hey, that's what makes this fun!

>>>  4) Should the consistent hashing algorithm map ids to leaf nodes  
>>> or just
>>>> to
>>>> children? I lean toward children because it encapsulates  
>>>> knowledge about
>>>> the
>>>> layout of subtrees at each tree level.
>>> If the algorithm maps to children, does that mean every document  
>>> lookup has
>>> to traverse the tree?  I'm not sure that's a great idea.  Up to  
>>> ~100 nodes I
>>> think it may be better to have all document lookups take O(1)  
>>> hops.  I think
>>> distributed Erlang can keep a system of that size globally  
>>> connected without
>>> too much trouble.
> I like the strict tree approach. I'd translate Adam's comment as:
> distributed Erlang can probably handle a tree of depth=1, even with
> ~100 nodes.

I'd like to hear more about how we implement redundancy and handle  
node failures in the tree structure.  In a pure consistent hashing  
ring, whether globally connected (Dynamo) or not (Chord), there are  
clear procedures for dealing with node failures, usually involving  
storing copies of the data at adjacent nodes along the ring.  Do we  
have an analogue of that in the tree?  I'm especially worried about  
what happens when inner nodes go down.

Best, Adam

View raw message