Return-Path: X-Original-To: apmail-zookeeper-user-archive@www.apache.org Delivered-To: apmail-zookeeper-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 17D5F98A7 for ; Fri, 16 Mar 2012 03:44:19 +0000 (UTC) Received: (qmail 29106 invoked by uid 500); 16 Mar 2012 03:44:18 -0000 Delivered-To: apmail-zookeeper-user-archive@zookeeper.apache.org Received: (qmail 28807 invoked by uid 500); 16 Mar 2012 03:44:14 -0000 Mailing-List: contact user-help@zookeeper.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@zookeeper.apache.org Delivered-To: mailing list user@zookeeper.apache.org Received: (qmail 28754 invoked by uid 99); 16 Mar 2012 03:44:12 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 16 Mar 2012 03:44:12 +0000 X-ASF-Spam-Status: No, hits=0.3 required=5.0 tests=FREEMAIL_REPLY,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of shralex@gmail.com designates 209.85.210.42 as permitted sender) Received: from [209.85.210.42] (HELO mail-pz0-f42.google.com) (209.85.210.42) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 16 Mar 2012 03:44:04 +0000 Received: by dang27 with SMTP id g27so5693037dan.15 for ; Thu, 15 Mar 2012 20:43:43 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=references:in-reply-to:mime-version:content-type :content-transfer-encoding:message-id:cc:x-mailer:from:subject:date :to; bh=3cQJf8b52lWP6VWhMoEL7KPhgIKiASyQsFxD6Npx+Ys=; b=WHxoWbrh5gUZAiDE9op8uoZqZmpFgkRHuFm7lYnMXavsTTQNmU5LcqFdLyK8zE9yCY qUUJO8Aae2ngGy0YD0wZr6TyoylE145KuzeMk8c0tnkkJY5gn4rDRgyEX1sk0VO+BtXu OlMOlibOvnPBnt/7Re1+6sXc25AkwNkHohITFgDBlP9wR6BLLLT3lt7LsVW22IUjNys9 2ho7lFXVVkXyNSUYL2ie4EQyz6ijYDR+3Yqxs5JmHTE7D+ZgqKh6hYBQEXk/P8zN0m+n 1DQASelSDwTy2cyaUGKlKK505LIB7DFZBY5Z+15PVMsC0kQajZFEeyAR5L04KllcwSwj l7sw== Received: by 10.68.218.228 with SMTP id pj4mr1559706pbc.167.1331869423039; Thu, 15 Mar 2012 20:43:43 -0700 (PDT) Received: from [10.63.27.69] ([166.205.139.93]) by mx.google.com with ESMTPS id b10sm3252369pbr.46.2012.03.15.20.43.39 (version=TLSv1/SSLv3 cipher=OTHER); Thu, 15 Mar 2012 20:43:41 -0700 (PDT) References: <4F60C181.90802@nokia.com> <4F60CED2.5080601@nokia.com> <4F61BB76.5040505@nokia.com> In-Reply-To: Mime-Version: 1.0 (1.0) Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Message-Id: <950DE28D-95B5-402B-A994-4FBBCFCA002C@gmail.com> Cc: "christian.ziech@nokia.com" , "user@zookeeper.apache.org" X-Mailer: iPhone Mail (9B176) From: Alexander Shraer Subject: Re: Zookeeper on short lived VMs and ZOOKEEPER-107 Date: Thu, 15 Mar 2012 20:43:33 -0700 To: Alexander Shraer X-Virus-Checked: Checked by ClamAV on apache.org Actually its still not clear to me how you would enforce the 2x+1. In Zookee= per we can guarantee liveness (progress) only when x+1 are connected and up,= however safety (correctness) is always guaranteed, even if 2 out of 3 serve= rs are temporarily down. Your design needs the 2x+1 for safety, which I thin= k is problematic unless you can accurately detect failures (synchrony) and f= ailures are permanent. Alex On Mar 15, 2012, at 3:54 PM, Alexander Shraer wrote: > I think the concern is that the old VM can recover and try to > reconnect. Theoretically you could even go back and forth between new > and old VM. For example, suppose that you have servers > A, B and C in the cluster, A is the leader. C is slow and "replaced" > with C', then update U is acked by A and C', then A fails. In this > situation you cannot have additional failures. But with the > automatic replacement thing it can (theoretically) happen that C' > becomes a little slow, C connects to B and is chosen as leader, and > the committed update U is lost forever. This is perhaps unlikely but > possible... >=20 > Alex >=20 > On Thu, Mar 15, 2012 at 1:35 PM, wrote: >> I agree with your points about any kind of VMs having a hard to predict r= untime behaviour and that participants of the zookeeper ensemble running on a= VM could fail to send keep-alives for an uncertain amount of time. But I do= n't yet understand how that would break the approach I was mentioning: Just t= rying to re-resolve the InetAddress after an IO exception should in that cas= e still lead to the same original IP address (and eventually to that node re= joining the ensemble). >> Only if that host name (the old node was using) would be re-assigned to a= nother instance this step of re-resolving would point to a new IP (and hence= cause the old server to be replaced). >>=20 >> Did I understand your objection correctly? >>=20 >> ________________________________________ >> Von: ext Ted Dunning [ted.dunning@gmail.com] >> Gesendet: Donnerstag, 15. M=C3=A4rz 2012 19:50 >> Bis: user@zookeeper.apache.org >> Cc: shralex@gmail.com >> Betreff: Re: Zookeeper on short lived VMs and ZOOKEEPER-107 >>=20 >> Alexander's comment still applies. >>=20 >> VM's can function or go away completely, but they can also malfunction >> in more subtle ways such that they just go VEEEERRRRY slowly. You >> have to account for that failure mode. These failures can even be >> transient. >>=20 >> This would probably break your approach. >>=20 >> On 3/15/12, christian.ziech@nokia.com wrote: >>> Oh sorry there is a slight misunderstanding. With VM I did not mean the j= ava >>> vm but the Linux vm that contains the zookeeper node. We get notified if= >>> that goes away and is repurposed. >>>=20 >>> BR >>> Christian >>>=20 >>> Gesendet von meinem Nokia Lumia 800 >>> ________________________________ >>> Von: ext Alexander Shraer >>> Gesendet: 15.03.2012 16:33 >>> An: user@zookeeper.apache.org; Ziech Christian (Nokia-LC/Berlin) >>> Betreff: Re: Zookeeper on short lived VMs and ZOOKEEPER-107 >>>=20 >>> yes, by replacing x at a time from 2x+1 you have quorum intersection. >>>=20 >>> i have one more question - zookeeper itself doesn't assume perfect >>> failure detection, which your scheme requires. what if the VM didn't >>> actually fail but just slow and then tries to reconnect ? >>>=20 >>> On Thu, Mar 15, 2012 at 2:50 AM, Christian Ziech >>> wrote: >>>> I don't think that we could be running into a split brain problem in ou= r >>>> use >>>> case. >>>> Let me try to describe the scenario we are worried about (assuming an >>>> ensemble of 5 nodes A,B,C,D,E): >>>> - The ensemble is up and running and in sync >>>> - Node A with the host name "zookeeperA.whatever-domain.priv" goes down= >>>> because the VM has gone away >>>> - That removal of the VM is detected and a new VM is spawned with the s= ame >>>> host name "zookeeperA.whatever-domain.priv" - let's call that node A' >>>> - Node A' zookeeper wants to join the cluster - right now this gets >>>> rejected >>>> by the others since A' has a different IP address than A (and the old o= ne >>>> is >>>> "cached" in the InetSocketAddress of the QuorumPeer instance) >>>>=20 >>>> We could ensure that at any given time there is only at most one node w= ith >>>> host name "zookeeperA.whatever-domain.priv" known by the ensemble and t= hat >>>> once one node is replaced, it would not come back. Also we could make s= ure >>>> that our ensemble is big enough to compensate for a replacement of more= >>>> than >>>> x nodes at a time (setting it to x*2 + 1 nodes). >>>>=20 >>>> So if I did not misestimate our problem it should be (due to the >>>> restrictions) simpler than the problem to be solved by zookeeper-107. M= y >>>> intention is basically by solving this smaller discrete problem to not >>>> need >>>> to wait for that zookeeper-107 makes it into a release (the assumption i= s >>>> that a smaller fix has a possibly a chance to make it into the 3.4.x >>>> branch >>>> even). >>>>=20 >>>> Am 15.03.2012 07:46, schrieb ext Alexander Shraer: >>>>>=20 >>>>> Hi Christian, >>>>>=20 >>>>> ZK-107 would indeed allow you to add/remove servers and change their >>>>> addresses. >>>>>=20 >>>>>> We could ensure that we always have a more or less fixed quorum of >>>>>> zookeeper servers with a fixed set of host names. >>>>>=20 >>>>> You should probably also ensure that a majority of the old ensemble >>>>> intersects with a majority of the new one. >>>>> Otherwise you have to run a reconfiguration protocol similarly to ZK-1= 07. >>>>> For example, if you have 3 servers A B and C, and now you're adding D a= nd >>>>> E >>>>> that replace B and C, how would this work ? it is probable that D and= E >>>>> don't have the latest state (as you mention) and A is down or doesn't >>>>> have >>>>> the latest state too (a minority might not have the latest state). Als= o, >>>>> how >>>>> do you prevent split brain in this case ? meaning B and C thinking tha= t >>>>> they >>>>> are still operational ? perhaps I'm missing something but I suspect th= at >>>>> the >>>>> change you propose won't be enough... >>>>>=20 >>>>> Best Regards, >>>>> Alex >>>>>=20 >>>>>=20 >>>>> On Wed, Mar 14, 2012 at 10:01 AM, Christian Ziech >>>>> > wrote: >>>>>=20 >>>>> Just a small addition: In my opinion the patch could really boil >>>>> down to add a >>>>>=20 >>>>> quorumServer.electionAddr =3D new >>>>> InetSocketAddress(electionAddr.getHostName(), >>>>> electionAddr.getPort()); >>>>>=20 >>>>> in the catch(IOException e) clause of the connectOne() method of >>>>> the QuorumCnxManager. In addition on should perhaps make the >>>>> electionAddr field in the QuorumPeer.QuorumServer class volatile >>>>> to prevent races. >>>>>=20 >>>>> I haven't checked this change yet fully for implications but doing >>>>> a quick test on some machines at least showed it would solve our >>>>> use case. What do the more expert users / maintainers think - is >>>>> it even worthwhile to go that route? >>>>>=20 >>>>> Am 14.03.2012 17:04, schrieb ext Christian Ziech: >>>>>=20 >>>>> LEt me describe our upcoming use case in a few words: We are >>>>> planning to use zookeeper in a cloud were typically nodes come >>>>> and go unpredictably. We could ensure that we always have a >>>>> more or less fixed quorum of zookeeper servers with a fixed >>>>> set of host names. However the IPs associated with the host >>>>> names would change every time a new server comes up. I browsed >>>>> the code a little and it seems right now that the only problem >>>>> is that the zookeeper server is remembering the resolved >>>>> InetSocketAddress in its QuorumPeer hash map. >>>>>=20 >>>>> I saw that possibly ZOOKEEPER-107 would also solve that >>>>> problem but possibly in a more generic way than actually >>>>> needed (perhaps here I underestimate the impact of joining as >>>>> a server with an empty data directory to replace a server that >>>>> previously had one). >>>>>=20 >>>>> Given that - from looking at ZOOKEEPER-107 - it seems that it >>>>> will still take some time for the proposed fix to make it into >>>>> a release, would it make sense to invest time into a smaller >>>>> work fix just for this "replacing a dropped server without >>>>> rolling restarts" use case? Would there be a chance that a fix >>>>> for this makes it into the 3.4.x branch? >>>>>=20 >>>>> Are there perhaps other ways to get this use case supported >>>>> without the need for doing rolling restarts whenever we need >>>>> to replace one of the zookeeper servers? >>>>>=20 >>>>=20 >>>=20