Return-Path: Delivered-To: apmail-couchdb-dev-archive@www.apache.org Received: (qmail 20898 invoked from network); 1 Apr 2009 15:59:27 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 1 Apr 2009 15:59:27 -0000 Received: (qmail 64158 invoked by uid 500); 1 Apr 2009 15:59:26 -0000 Delivered-To: apmail-couchdb-dev-archive@couchdb.apache.org Received: (qmail 64038 invoked by uid 500); 1 Apr 2009 15:59:26 -0000 Mailing-List: contact dev-help@couchdb.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@couchdb.apache.org Delivered-To: mailing list dev@couchdb.apache.org Received: (qmail 64024 invoked by uid 99); 1 Apr 2009 15:59:26 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 01 Apr 2009 15:59:26 +0000 X-ASF-Spam-Status: No, hits=0.4 required=10.0 tests=SPF_PASS,SUBJECT_FUZZY_TION X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of jchris@gmail.com designates 74.125.78.24 as permitted sender) Received: from [74.125.78.24] (HELO ey-out-2122.google.com) (74.125.78.24) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 01 Apr 2009 15:59:16 +0000 Received: by ey-out-2122.google.com with SMTP id 25so25736eya.29 for ; Wed, 01 Apr 2009 08:58:55 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:sender:received:in-reply-to :references:date:x-google-sender-auth:message-id:subject:from:to :content-type:content-transfer-encoding; bh=PQNWzk3SYlbruCBt7cc3r0CKvSA44iL65p9IlH1sI6k=; b=aKgzgPkppo3zqayjnlBFDQT3OaNd2hcI5CWQdamjLRnnPasoYU4D/8Hv84CZTbSR8s drt2Y9p+GiGVm/RrFfvTvB7Zi7lEQ00Jb5cmMl3V4uZZgD3AI+DzJ6du8d48WJ39VdT5 9pAiA/qs3kFP6hIfKGJGRXYqC3C/pVWEttJro= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:sender:in-reply-to:references:date :x-google-sender-auth:message-id:subject:from:to:content-type :content-transfer-encoding; b=nDvFutGatUWVY44XK3AREYxfa0BglTs3UV+iZ0jGC8cALKG/c/lBiV5qeNpYFjusvo Vw/zdS11mrdEEKnHN8OLn1XmDqGzoCTdfu6fjtZZYx1fHn1eDAFpjg11pD8XSMFLbLjY 7GJLXB96eFxB5dk1yM4Iduk/ZGUGJbj3l2fHc= MIME-Version: 1.0 Sender: jchris@gmail.com Received: by 10.210.144.3 with SMTP id r3mr6075936ebd.31.1238601535878; Wed, 01 Apr 2009 08:58:55 -0700 (PDT) In-Reply-To: <63F56D64-2314-490F-AEA7-FC73E956DD8B@apache.org> References: <1C84EB8A-EF5D-41EB-B7BD-77F660AB7087@apache.org> <63F56D64-2314-490F-AEA7-FC73E956DD8B@apache.org> Date: Wed, 1 Apr 2009 08:58:55 -0700 X-Google-Sender-Auth: 4d7e9024331321be Message-ID: Subject: Re: CouchDB Cluster/Partition GSoC From: Chris Anderson To: dev@couchdb.apache.org Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable X-Virus-Checked: Checked by ClamAV on apache.org On Wed, Apr 1, 2009 at 8:37 AM, Adam Kocoloski wrote: > On Apr 1, 2009, at 11:03 AM, Chris Anderson wrote: > >>>> =A02) What about _all_docs and seq-num? >>>> >>>> I presume _all_docs gets merged like any other view. =A0_all_docs_by_s= eq >>>> is a >>>> different story. =A0In the current code the sequence number is increme= nted >>>> by >>>> one for every update. =A0If we want to preserve that behavior in >>>> partitioned >>>> databases we need some sort of consensus algorithm or master server to >>>> choose the next sequence number. =A0It could easily turn into a bottle= neck >>>> or >>>> single point-of-failure if we're not careful. >>>> >>>> The alternatives are to a) abandon the current format for update >>>> sequences >>>> in favor of vector clocks or something more opaque, or b) have >>>> _all_docs_by_seq be strictly a node-local query. =A0I'd prefer the for= mer, >>>> as >>>> I think it will be useful for e.g. external indexers to treat the >>>> partitioned database just like a single server one. =A0If we do the >>>> latter, I >>>> think it means that either the external indexers have to be installed = on >>>> every node, or at least they have to be aware of all the partitions. >>> >>> >>> If at all possible I think we should have the entire partition group >>> appear >>> as a single server from the outside. One thing that comes to mind here = is >>> a >>> question about sequence numbers. Vector clocks only guarantee a partial >>> ordering, but I'm under the impression that currently seq numbers have = a >>> strict ordering. >>> >>> Database sequence numbers are used in replication and in determining >>> whether >>> views need refreshing. Anything else I'm missing? Currently it seems >>> there >>> is no tracking of which updates actually change a view index (comment o= n >>> line 588 of couch_httpd_view.erl on trunk). Improving this would be a >>> nice >>> win. See my answer to number (3). >>> >>> The easy way to manage seq numbers is to let one node be the write mast= er >>> for any cluster. (The root node of any partition group could actually b= e >>> a >>> cluster, but if writes always go through a master the master can mainta= in >>> the global sequence number for the partition group). >> >> The problem with this approach is that the main use-case for >> partitioning is when your incoming writes exceed the capacity of a >> single node. By partitioning the key-space, you can get more >> write-throughput. > > I think Randall was saying requests just have to originate at the master > node. =A0That master node could do nothing more than assign a sequence nu= mber, > choose a node, and proxy the request down the tree for the heavy lifting.= =A0I > bet we could get pretty good throughput, but I still worry about this > approach for availability reasons. Yes, I agree. I think vector clocks are a good compromise. I hadn't considered that since index updates are idempotent, we can allow a little slop in the global clock. This makes everything much easier. > >> I'm not sure that an update-seq per node is such a bad thing, as it >> will require any external indexers to be deployed in a 1-to-1 >> relationship to nodes, which automatically balances the load for the >> indexer as well. With a merged seq-id, users would be encouraged to >> partition CouchDB without bothering to partition indexers. Maybe this >> is acceptable in some cases, but not in the general case. > > So, the vector clock approach still has a per-node update sequence for ea= ch > node's local clock, it just does the best job possible of globally orderi= ng > those per-node sequences. =A0We could easily offer local update sequences= as > well via some query string parameter. =A0I understand the desire to encou= rage > partitioned indexers, but I believe that won't always be possible. =A0Bot= tom > line, I think we should support global indexing of a partitioned DB. > I think you're right. As long as partitioned indexers are possible, I have nothing against making global indexers as well. > > I'd like to hear more about how we implement redundancy and handle node > failures in the tree structure. =A0In a pure consistent hashing ring, whe= ther > globally connected (Dynamo) or not (Chord), there are clear procedures fo= r > dealing with node failures, usually involving storing copies of the data = at > adjacent nodes along the ring. =A0Do we have an analogue of that in the t= ree? > =A0I'm especially worried about what happens when inner nodes go down. > I like to think of partitioning and redundancy as orthogonal. If each node has a hot-failover "twin", then parent nodes can track for all of their children, the children's twins as well, and swap them out in case of unavailability. I'm not so hot on the Chord / Dynamo style of storing parts of partitions on other partitions. Even just saying that is confusing. Because physical nodes need not map directly to logical nodes, we just need to be sure that each node's twin lives on a different physical node (which it can share with other logical nodes). The end result is that we can have N duplicates of the entire tree, and even load-balance across them. It'd be a small optimization to allow you to read from both twins and write to just one. Chris --=20 Chris Anderson http://jchrisa.net http://couch.io