couchdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Adam Kocoloski <adam.kocolo...@gmail.com>
Subject Re: CouchDB Cluster/Partition GSoC
Date Wed, 01 Apr 2009 15:10:07 GMT
On Apr 1, 2009, at 1:59 AM, Randall Leeds wrote:

> On Sun, Mar 29, 2009 at 22:12, Adam Kocoloski <kocolosk@apache.org>  
> wrote:
>
>> Hi Randall, cool!  I can chime in on a couple of the questions ...
>>
>
> Adam, thanks for your quick reply and thorough comments. The more  
> people
> chime in on this discussion the better I can make the proposal, both  
> in
> terms of likelihood for acceptance by a mentor/Google and the value  
> of the
> resulting work for the community. I will aim to post a formalized  
> draft of
> the proposal on my GSoC registration page sometime tomorrow and open  
> it up
> for comments. Submission deadline is Friday.

Sounds like a plan to me.

>> 2) What about _all_docs and seq-num?
>>
>> I presume _all_docs gets merged like any other view.   
>> _all_docs_by_seq is a
>> different story.  In the current code the sequence number is  
>> incremented by
>> one for every update.  If we want to preserve that behavior in  
>> partitioned
>> databases we need some sort of consensus algorithm or master server  
>> to
>> choose the next sequence number.  It could easily turn into a  
>> bottleneck or
>> single point-of-failure if we're not careful.
>>
>> The alternatives are to a) abandon the current format for update  
>> sequences
>> in favor of vector clocks or something more opaque, or b) have
>> _all_docs_by_seq be strictly a node-local query.  I'd prefer the  
>> former, as
>> I think it will be useful for e.g. external indexers to treat the
>> partitioned database just like a single server one.  If we do the  
>> latter, I
>> think it means that either the external indexers have to be  
>> installed on
>> every node, or at least they have to be aware of all the partitions.
>
> If at all possible I think we should have the entire partition group  
> appear
> as a single server from the outside. One thing that comes to mind  
> here is a
> question about sequence numbers. Vector clocks only guarantee a  
> partial
> ordering, but I'm under the impression that currently seq numbers  
> have a
> strict ordering.

Yes, that's a good point.  Vector clocks may not be sufficient here.   
On the other hand, do we absolutely need a strict ordering of events?   
If the purpose of these sequence numbers is to ensure that replicators  
and indexers don't miss any updates, can't we just interpret GET  
_all_docs_by_seq as "give me all the updates that *might* have  
happened after X"?  That's a request we can answer with vector clocks,  
it's just the set of all updates such that VC(X') >= VC(X).  Of  
course, it's less efficient in that we may send duplicate updates in a  
write-heavy scenario.

Caveat: I haven't given much thought to how we'd efficiently store old  
versions of the vector clock at all nodes.

> Database sequence numbers are used in replication and in determining  
> whether
> views need refreshing. Anything else I'm missing?

Any external indexers (couchdb-lucene, for instance) also need the  
sequence numbers.

> Currently it seems there is no tracking of which updates actually  
> change a view index (comment on
> line 588 of couch_httpd_view.erl on trunk). Improving this would be  
> a nice
> win. See my answer to number (3).

That's only partially true.  You're right that the Etags aren't yet  
smart enough to know when a view stayed the same.  However, we  
definitely do track relationships between documents and view keys in a  
separate btree -- we have to if we want to expire the old view rows  
when a document is updated.  I think we should eventually be able to  
leverage this information to be smarter about view Etags.

> The easy way to manage seq numbers is to let one node be the write  
> master
> for any cluster. (The root node of any partition group could  
> actually be a
> cluster, but if writes always go through a master the master can  
> maintain
> the global sequence number for the partition group).

Yeah, this scares me a little bit.  I assume by a write master you  
mean a node that's responsible for ordering all of the updates to a  
database, regardless of where those documents are actually stored on  
disk.  I'm sure we can build a performant implementation (it's just a  
counter, after all), but I worry about the availability of such a  
system.  I guess that's what supervision trees are for ... but I'd  
prefer to try to solve these problems in a decentralized manner if  
possible.  My $.02.

>
>>
>> 3) Can we agree on a proposed solution to the layout of partition  
>> nodes? I
>>> like the tree solution, as long as it is extremely flexible wrt  
>>> tree depth.
>>>
>>
>> I'm not sure we're ready to do that.  In fact, I think we may need to
>> implement a couple of different topologies and profile them to see  
>> what
>> works best.  The tree topology is an interesting idea, but it may
>>
>
> That was a silly question. I didn't expect these questions to be  
> easy. That
> should have read as a discussion prompt rather than a call for  
> consensus.
>
> We should probably clarify the impetus for a tree structure.  
> Computationally
> intensive reduces is the primary use case and the tree is a good way  
> to get
> speedup here. In the case of a map-only view, we still need to merge  
> and
> aggregate the results from each shard. This merge needs to happen  
> somewhere,
> likely either at the node that's servicing the request or  
> recursively up the
> tree. In either case, we agree that there's not much win if every view
> request has to hit every node. Therefore, I think we may need to start
> tracking which updates affect the view index.

Good point -- caching the map-only views from leaf nodes could be a  
nice win for the tree structure. It hadn't clicked for me until just  
now.  Best,

Adam

> So, we need a consistent hash implementation. I will include this in  
> the
> proposal.
> From there, where should we go?
>
> Thanks in advance,
> Randall


Mime
View raw message