Mailing-List: contact user-help@zookeeper.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@zookeeper.apache.org
Received-SPF: pass (nike.apache.org: domain of shralex@gmail.com designates
 209.85.210.42 as permitted sender)
References: <4F60C181.90802@nokia.com> <4F60CED2.5080601@nokia.com>
 <CANcXBFP7WUceMbmEJ2etU3NESProbY01UNafnVYiQEvFOdQgmA@mail.gmail.com>
 <4F61BB76.5040505@nokia.com>
 <CANcXBFN9ASke0CUzL+Rh9tR1CjDGbojPY1oBFRvPS7JBPi6EpA@mail.gmail.com>
 <BCCAB10C331D2F4294626F5E2BE466501630ADA5@008-AM1MPN1-073.mgdnok.nokia.com>
 <CAJwFCa0bUh=fuQgN8Yo1R+C4VfWQqDjQRe0tNrf-XgZUqsyyqg@mail.gmail.com>
 <BCCAB10C331D2F4294626F5E2BE466501630ADF7@008-AM1MPN1-073.mgdnok.nokia.com>
 <CANcXBFMAOvP97Qy5o-poukLDVA3qsLe9B_=MqLbnVRWegGgsLg@mail.gmail.com>
In-Reply-To: 
 <CANcXBFMAOvP97Qy5o-poukLDVA3qsLe9B_=MqLbnVRWegGgsLg@mail.gmail.com>
Mime-Version: 1.0 (1.0)
Content-Type: text/plain;
	charset=utf-8
Content-Transfer-Encoding: quoted-printable
Message-Id: <950DE28D-95B5-402B-A994-4FBBCFCA002C@gmail.com>
Cc: "christian.ziech@nokia.com" <christian.ziech@nokia.com>,
 "user@zookeeper.apache.org" <user@zookeeper.apache.org>
From: Alexander Shraer <shralex@gmail.com>
Subject: Re: Zookeeper on short lived VMs and ZOOKEEPER-107
Date: Thu, 15 Mar 2012 20:43:33 -0700
To: Alexander Shraer <shralex@gmail.com>

Actually its still not clear to me how you would enforce the 2x+1. In Zookee=
per we can guarantee liveness (progress) only when x+1 are connected and up,=
 however safety (correctness) is always guaranteed, even if 2 out of 3 serve=
rs are temporarily down. Your design needs the 2x+1 for safety, which I thin=
k is problematic unless you can accurately detect failures (synchrony) and f=
ailures are permanent.

Alex


On Mar 15, 2012, at 3:54 PM, Alexander Shraer <shralex@gmail.com> wrote:

> I think the concern is that the old VM can recover and try to
> reconnect. Theoretically you could even go back and forth between new
> and old VM. For example, suppose that you have servers
> A, B and C in the cluster, A is the leader. C is slow and "replaced"
> with C', then update U is acked by A and C', then A fails. In this
> situation you cannot have additional failures. But with the
> automatic replacement thing it can (theoretically) happen that C'
> becomes a little slow, C connects to B and is chosen as leader, and
> the committed update U is lost forever. This is perhaps unlikely but
> possible...
>=20
> Alex
>=20
> On Thu, Mar 15, 2012 at 1:35 PM,  <christian.ziech@nokia.com> wrote:
>> I agree with your points about any kind of VMs having a hard to predict r=
untime behaviour and that participants of the zookeeper ensemble running on a=
 VM could fail to send keep-alives for an uncertain amount of time. But I do=
n't yet understand how that would break the approach I was mentioning: Just t=
rying to re-resolve the InetAddress after an IO exception should in that cas=
e still lead to the same original IP address (and eventually to that node re=
joining the ensemble).
>> Only if that host name (the old node was using) would be re-assigned to a=
nother instance this step of re-resolving would point to a new IP (and hence=
 cause the old server to be replaced).
>>=20
>> Did I understand your objection correctly?
>>=20
>> ________________________________________
>> Von: ext Ted Dunning [ted.dunning@gmail.com]
>> Gesendet: Donnerstag, 15. M=C3=A4rz 2012 19:50
>> Bis: user@zookeeper.apache.org
>> Cc: shralex@gmail.com
>> Betreff: Re: Zookeeper on short lived VMs and ZOOKEEPER-107
>>=20
>> Alexander's comment still applies.
>>=20
>> VM's can function or go away completely, but they can also malfunction
>> in more subtle ways such that they just go VEEEERRRRY slowly.  You
>> have to account for that failure mode.  These failures can even be
>> transient.
>>=20
>> This would probably break your approach.
>>=20
>> On 3/15/12, christian.ziech@nokia.com <christian.ziech@nokia.com> wrote:
>>> Oh sorry there is a slight misunderstanding. With VM I did not mean the j=
ava
>>> vm but the Linux vm that contains the zookeeper node. We get notified if=

>>> that goes away and is repurposed.
>>>=20
>>> BR
>>>  Christian
>>>=20
>>> Gesendet von meinem Nokia Lumia 800
>>> ________________________________
>>> Von: ext Alexander Shraer
>>> Gesendet: 15.03.2012 16:33
>>> An: user@zookeeper.apache.org; Ziech Christian (Nokia-LC/Berlin)
>>> Betreff: Re: Zookeeper on short lived VMs and ZOOKEEPER-107
>>>=20
>>> yes, by replacing x at a time from 2x+1 you have quorum intersection.
>>>=20
>>> i have one more question - zookeeper itself doesn't assume perfect
>>> failure detection, which your scheme requires. what if the VM didn't
>>> actually fail but just slow and then tries to reconnect ?
>>>=20
>>> On Thu, Mar 15, 2012 at 2:50 AM, Christian Ziech
>>> <christian.ziech@nokia.com> wrote:
>>>> I don't think that we could be running into a split brain problem in ou=
r
>>>> use
>>>> case.
>>>> Let me try to describe the scenario we are worried about (assuming an
>>>> ensemble of 5 nodes A,B,C,D,E):
>>>> - The ensemble is up and running and in sync
>>>> - Node A with the host name "zookeeperA.whatever-domain.priv" goes down=

>>>> because the VM has gone away
>>>> - That removal of the VM is detected and a new VM is spawned with the s=
ame
>>>> host name "zookeeperA.whatever-domain.priv" - let's call that node A'
>>>> - Node A' zookeeper wants to join the cluster - right now this gets
>>>> rejected
>>>> by the others since A' has a different IP address than A (and the old o=
ne
>>>> is
>>>> "cached" in the InetSocketAddress of the QuorumPeer instance)
>>>>=20
>>>> We could ensure that at any given time there is only at most one node w=
ith
>>>> host name "zookeeperA.whatever-domain.priv" known by the ensemble and t=
hat
>>>> once one node is replaced, it would not come back. Also we could make s=
ure
>>>> that our ensemble is big enough to compensate for a replacement of more=

>>>> than
>>>> x nodes at a time (setting it to x*2 + 1 nodes).
>>>>=20
>>>> So if I did not misestimate our problem it should be (due to the
>>>> restrictions) simpler than the problem to be solved by zookeeper-107. M=
y
>>>> intention is basically by solving this smaller discrete problem to not
>>>> need
>>>> to wait for that zookeeper-107 makes it into a release (the assumption i=
s
>>>> that a smaller fix has a possibly a chance to make it into the 3.4.x
>>>> branch
>>>> even).
>>>>=20
>>>> Am 15.03.2012 07:46, schrieb ext Alexander Shraer:
>>>>>=20
>>>>> Hi Christian,
>>>>>=20
>>>>> ZK-107 would indeed allow you to add/remove servers and change their
>>>>> addresses.
>>>>>=20
>>>>>> We could ensure that we always have a more or less fixed quorum of
>>>>>> zookeeper servers with a fixed set of host names.
>>>>>=20
>>>>> You should probably also ensure that a majority of the old ensemble
>>>>> intersects with a majority of the new one.
>>>>> Otherwise you have to run a reconfiguration protocol similarly to ZK-1=
07.
>>>>> For example, if you have 3 servers A B and C, and now you're adding D a=
nd
>>>>> E
>>>>> that replace B and C, how would this work ?  it is probable that D and=
 E
>>>>> don't have the latest state (as you mention) and A is down or doesn't
>>>>> have
>>>>> the latest state too (a minority might not have the latest state). Als=
o,
>>>>> how
>>>>> do you prevent split brain in this case ? meaning B and C thinking tha=
t
>>>>> they
>>>>> are still operational ? perhaps I'm missing something but I suspect th=
at
>>>>> the
>>>>> change you propose won't be enough...
>>>>>=20
>>>>> Best Regards,
>>>>> Alex
>>>>>=20
>>>>>=20
>>>>> On Wed, Mar 14, 2012 at 10:01 AM, Christian Ziech
>>>>> <christian.ziech@nokia.com <mailto:christian.ziech@nokia.com>> wrote:
>>>>>=20
>>>>>   Just a small addition: In my opinion the patch could really boil
>>>>>   down to add a
>>>>>=20
>>>>>     quorumServer.electionAddr =3D new
>>>>>     InetSocketAddress(electionAddr.getHostName(),
>>>>>   electionAddr.getPort());
>>>>>=20
>>>>>   in the catch(IOException e) clause of the connectOne() method of
>>>>>   the QuorumCnxManager. In addition on should perhaps make the
>>>>>   electionAddr field in the QuorumPeer.QuorumServer class volatile
>>>>>   to prevent races.
>>>>>=20
>>>>>   I haven't checked this change yet fully for implications but doing
>>>>>   a quick test on some machines at least showed it would solve our
>>>>>   use case. What do the more expert users / maintainers think - is
>>>>>   it even worthwhile to go that route?
>>>>>=20
>>>>>   Am 14.03.2012 17:04, schrieb ext Christian Ziech:
>>>>>=20
>>>>>       LEt me describe our upcoming use case in a few words: We are
>>>>>       planning to use zookeeper in a cloud were typically nodes come
>>>>>       and go unpredictably. We could ensure that we always have a
>>>>>       more or less fixed quorum of zookeeper servers with a fixed
>>>>>       set of host names. However the IPs associated with the host
>>>>>       names would change every time a new server comes up. I browsed
>>>>>       the code a little and it seems right now that the only problem
>>>>>       is that the zookeeper server is remembering the resolved
>>>>>       InetSocketAddress in its QuorumPeer hash map.
>>>>>=20
>>>>>       I saw that possibly ZOOKEEPER-107 would also solve that
>>>>>       problem but possibly in a more generic way than actually
>>>>>       needed (perhaps here I underestimate the impact of joining as
>>>>>       a server with an empty data directory to replace a server that
>>>>>       previously had one).
>>>>>=20
>>>>>       Given that - from looking at ZOOKEEPER-107 - it seems that it
>>>>>       will still take some time for the proposed fix to make it into
>>>>>       a release, would it make sense to invest time into a smaller
>>>>>       work fix just for this "replacing a dropped server without
>>>>>       rolling restarts" use case? Would there be a chance that a fix
>>>>>       for this makes it into the 3.4.x branch?
>>>>>=20
>>>>>       Are there perhaps other ways to get this use case supported
>>>>>       without the need for doing rolling restarts whenever we need
>>>>>       to replace one of the zookeeper servers?
>>>>>=20
>>>>=20
>>>=20