Mailing-List: contact user-help@zookeeper.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@zookeeper.apache.org
Received-SPF: pass (nike.apache.org: domain of shralex@gmail.com designates
 209.85.210.170 as permitted sender)
MIME-Version: 1.0
In-Reply-To: <4F630E69.4020902@nokia.com>
References: <4F60C181.90802@nokia.com> <4F60CED2.5080601@nokia.com>
 <CANcXBFP7WUceMbmEJ2etU3NESProbY01UNafnVYiQEvFOdQgmA@mail.gmail.com>
 <4F61BB76.5040505@nokia.com>
 <CANcXBFN9ASke0CUzL+Rh9tR1CjDGbojPY1oBFRvPS7JBPi6EpA@mail.gmail.com>
 <BCCAB10C331D2F4294626F5E2BE466501630ADA5@008-AM1MPN1-073.mgdnok.nokia.com>
 <CAJwFCa0bUh=fuQgN8Yo1R+C4VfWQqDjQRe0tNrf-XgZUqsyyqg@mail.gmail.com>
 <BCCAB10C331D2F4294626F5E2BE466501630ADF7@008-AM1MPN1-073.mgdnok.nokia.com>
 <CANcXBFMAOvP97Qy5o-poukLDVA3qsLe9B_=MqLbnVRWegGgsLg@mail.gmail.com>
 <950DE28D-95B5-402B-A994-4FBBCFCA002C@gmail.com> <4F630E69.4020902@nokia.com>
From: Alexander Shraer <shralex@gmail.com>
Date: Fri, 16 Mar 2012 11:37:37 -0700
Message-ID: 
 <CANcXBFOr9eMnhs2=uVBt=YNr-o9z9_un95XEB0P402DPOWzELw@mail.gmail.com>
Subject: Re: Zookeeper on short lived VMs and ZOOKEEPER-107
To: Christian Ziech <christian.ziech@nokia.com>
Cc: "user@zookeeper.apache.org" <user@zookeeper.apache.org>
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

I think this is why when you're doing rolling restarts /
reconfiguration you should never have two different servers that have
any chance of being up at the same time with the same id.
With 107 you'd have to remove the server and add a new server with
some different id (choosing the new id is left to the user).

In terms of support with 107 we need all the help we can get :)
Currently there are two parts of it in pretty good shape that I'm
hoping to integrate soon: 1355 and 1411.
Comments or testing of 1411 would be very helpful at this point. Also,
if you wish, you can check out the latest patch for 107 (that patch is
not going to be integrated - instead I'm trying to get it in piece by
piece, but still, you can try it and see if it works for you or if you
have comments. You can also help by writing tests for it).

Best Regards,
Alex

On Fri, Mar 16, 2012 at 2:56 AM, Christian Ziech
<christian.ziech@nokia.com> wrote:
> Under normal circumstances the ability to detect failures correctly shoul=
d
> be given. The scenario I'm aware of includes one zookeeper system would b=
e
> taken down for a reason and then possibly just rebooted or even started f=
rom
> scratch elsewhere. In both cases however the new host would have the old =
dns
> name but most likely a different IP. But of course that only applies to u=
s
> and possibly not to all of the users.
>
> When thinking about the scenario you described I understood where the
> problem lies. However wouldn't the same problem also be relevant the way
> zookeeper is implemented right now? Let me try to explain why (possibly I=
'm
> wrong here since I may miss some points on how zookeeper servers works
> internally - corrections are very welcome):
> - Same scenarios as you described - nodes A with host name a, B host name=
 b
> and C with host name c
> - Also same as in your scenario C is due to some human error falsely
> detected as down. Hence C' is brought up and is assigned the same DNS nam=
e
> as C
> - Now rolling restarts are performed to bring in C'
> - A resolves c correctly to the new IP and connects to C' but B still
> resolves the host name c to the original address of C and hence does not
> connect (I think some DNS slowness is also required for your approach in
> order for the host name c being resolved to the original IP of C)
> - now the rest of your scenario happens: Update U is applied, C' gets slo=
w,
> C recovers and A fails.
> Of course also this approach requires some DNS craziness but if I did not
> make a mistake in my thoughts it should still be possible.
>
> PS: Wouldn't your scenario not also invalidate the solution of the hbase
> guys using amazons elastic ips to solve the same problem (see
> https://issues.apache.org/jira/browse/HBASE-2327)?
> PS2: If the approach I had in mind is not valid, do you guys already have=
 a
> plan for when 3.5.0 would be released or could you guys be supported in s=
ome
> way so that zookeeper-107 makes it sooner into a release?
>
> Am 16.03.2012 04:43, schrieb ext Alexander Shraer:
>
>> Actually its still not clear to me how you would enforce the 2x+1. In
>> Zookeeper we can guarantee liveness (progress) only when x+1 are connect=
ed
>> and up, however safety (correctness) is always guaranteed, even if 2 out=
 of
>> 3 servers are temporarily down. Your design needs the 2x+1 for safety, w=
hich
>> I think is problematic unless you can accurately detect failures (synchr=
ony)
>> and failures are permanent.
>>
>> Alex
>>
>>
>> On Mar 15, 2012, at 3:54 PM, Alexander Shraer<shralex@gmail.com> =A0wrot=
e:
>>
>>> I think the concern is that the old VM can recover and try to
>>> reconnect. Theoretically you could even go back and forth between new
>>> and old VM. For example, suppose that you have servers
>>> A, B and C in the cluster, A is the leader. C is slow and "replaced"
>>> with C', then update U is acked by A and C', then A fails. In this
>>> situation you cannot have additional failures. But with the
>>> automatic replacement thing it can (theoretically) happen that C'
>>> becomes a little slow, C connects to B and is chosen as leader, and
>>> the committed update U is lost forever. This is perhaps unlikely but
>>> possible...
>>>
>>> Alex
>>>
>>> On Thu, Mar 15, 2012 at 1:35 PM,<christian.ziech@nokia.com> =A0wrote:
>>>>
>>>> I agree with your points about any kind of VMs having a hard to predic=
t
>>>> runtime behaviour and that participants of the zookeeper ensemble runn=
ing on
>>>> a VM could fail to send keep-alives for an uncertain amount of time. B=
ut I
>>>> don't yet understand how that would break the approach I was mentionin=
g:
>>>> Just trying to re-resolve the InetAddress after an IO exception should=
 in
>>>> that case still lead to the same original IP address (and eventually t=
o that
>>>> node rejoining the ensemble).
>>>> Only if that host name (the old node was using) would be re-assigned t=
o
>>>> another instance this step of re-resolving would point to a new IP (an=
d
>>>> hence cause the old server to be replaced).
>>>>
>>>> Did I understand your objection correctly?
>>>>
>>>> ________________________________________
>>>> Von: ext Ted Dunning [ted.dunning@gmail.com]
>>>> Gesendet: Donnerstag, 15. M=E4rz 2012 19:50
>>>> Bis: user@zookeeper.apache.org
>>>> Cc: shralex@gmail.com
>>>> Betreff: Re: Zookeeper on short lived VMs and ZOOKEEPER-107
>>>>
>>>> Alexander's comment still applies.
>>>>
>>>> VM's can function or go away completely, but they can also malfunction
>>>> in more subtle ways such that they just go VEEEERRRRY slowly. =A0You
>>>> have to account for that failure mode. =A0These failures can even be
>>>> transient.
>>>>
>>>> This would probably break your approach.
>>>>
>>>> On 3/15/12, christian.ziech@nokia.com<christian.ziech@nokia.com> =A0wr=
ote:
>>>>>
>>>>> Oh sorry there is a slight misunderstanding. With VM I did not mean t=
he
>>>>> java
>>>>> vm but the Linux vm that contains the zookeeper node. We get notified
>>>>> if
>>>>> that goes away and is repurposed.
>>>>>
>>>>> BR
>>>>> =A0Christian
>>>>>
>>>>> Gesendet von meinem Nokia Lumia 800
>>>>> ________________________________
>>>>> Von: ext Alexander Shraer
>>>>> Gesendet: 15.03.2012 16:33
>>>>> An: user@zookeeper.apache.org; Ziech Christian (Nokia-LC/Berlin)
>>>>> Betreff: Re: Zookeeper on short lived VMs and ZOOKEEPER-107
>>>>>
>>>>> yes, by replacing x at a time from 2x+1 you have quorum intersection.
>>>>>
>>>>> i have one more question - zookeeper itself doesn't assume perfect
>>>>> failure detection, which your scheme requires. what if the VM didn't
>>>>> actually fail but just slow and then tries to reconnect ?
>>>>>
>>>>> On Thu, Mar 15, 2012 at 2:50 AM, Christian Ziech
>>>>> <christian.ziech@nokia.com> =A0wrote:
>>>>>>
>>>>>> I don't think that we could be running into a split brain problem in
>>>>>> our
>>>>>> use
>>>>>> case.
>>>>>> Let me try to describe the scenario we are worried about (assuming a=
n
>>>>>> ensemble of 5 nodes A,B,C,D,E):
>>>>>> - The ensemble is up and running and in sync
>>>>>> - Node A with the host name "zookeeperA.whatever-domain.priv" goes
>>>>>> down
>>>>>> because the VM has gone away
>>>>>> - That removal of the VM is detected and a new VM is spawned with th=
e
>>>>>> same
>>>>>> host name "zookeeperA.whatever-domain.priv" - let's call that node A=
'
>>>>>> - Node A' zookeeper wants to join the cluster - right now this gets
>>>>>> rejected
>>>>>> by the others since A' has a different IP address than A (and the ol=
d
>>>>>> one
>>>>>> is
>>>>>> "cached" in the InetSocketAddress of the QuorumPeer instance)
>>>>>>
>>>>>> We could ensure that at any given time there is only at most one nod=
e
>>>>>> with
>>>>>> host name "zookeeperA.whatever-domain.priv" known by the ensemble an=
d
>>>>>> that
>>>>>> once one node is replaced, it would not come back. Also we could mak=
e
>>>>>> sure
>>>>>> that our ensemble is big enough to compensate for a replacement of
>>>>>> more
>>>>>> than
>>>>>> x nodes at a time (setting it to x*2 + 1 nodes).
>>>>>>
>>>>>> So if I did not misestimate our problem it should be (due to the
>>>>>> restrictions) simpler than the problem to be solved by zookeeper-107=
.
>>>>>> My
>>>>>> intention is basically by solving this smaller discrete problem to n=
ot
>>>>>> need
>>>>>> to wait for that zookeeper-107 makes it into a release (the assumpti=
on
>>>>>> is
>>>>>> that a smaller fix has a possibly a chance to make it into the 3.4.x
>>>>>> branch
>>>>>> even).
>>>>>>
>>>>>> Am 15.03.2012 07:46, schrieb ext Alexander Shraer:
>>>>>>>
>>>>>>> Hi Christian,
>>>>>>>
>>>>>>> ZK-107 would indeed allow you to add/remove servers and change thei=
r
>>>>>>> addresses.
>>>>>>>
>>>>>>>> We could ensure that we always have a more or less fixed quorum of
>>>>>>>> zookeeper servers with a fixed set of host names.
>>>>>>>
>>>>>>> You should probably also ensure that a majority of the old ensemble
>>>>>>> intersects with a majority of the new one.
>>>>>>> Otherwise you have to run a reconfiguration protocol similarly to
>>>>>>> ZK-107.
>>>>>>> For example, if you have 3 servers A B and C, and now you're adding=
 D
>>>>>>> and
>>>>>>> E
>>>>>>> that replace B and C, how would this work ? =A0it is probable that =
D
>>>>>>> and E
>>>>>>> don't have the latest state (as you mention) and A is down or doesn=
't
>>>>>>> have
>>>>>>> the latest state too (a minority might not have the latest state).
>>>>>>> Also,
>>>>>>> how
>>>>>>> do you prevent split brain in this case ? meaning B and C thinking
>>>>>>> that
>>>>>>> they
>>>>>>> are still operational ? perhaps I'm missing something but I suspect
>>>>>>> that
>>>>>>> the
>>>>>>> change you propose won't be enough...
>>>>>>>
>>>>>>> Best Regards,
>>>>>>> Alex
>>>>>>>
>>>>>>>
>>>>>>> On Wed, Mar 14, 2012 at 10:01 AM, Christian Ziech
>>>>>>> <christian.ziech@nokia.com<mailto:christian.ziech@nokia.com>> =A0wr=
ote:
>>>>>>>
>>>>>>> =A0 Just a small addition: In my opinion the patch could really boi=
l
>>>>>>> =A0 down to add a
>>>>>>>
>>>>>>> =A0 =A0 quorumServer.electionAddr =3D new
>>>>>>> =A0 =A0 InetSocketAddress(electionAddr.getHostName(),
>>>>>>> =A0 electionAddr.getPort());
>>>>>>>
>>>>>>> =A0 in the catch(IOException e) clause of the connectOne() method o=
f
>>>>>>> =A0 the QuorumCnxManager. In addition on should perhaps make the
>>>>>>> =A0 electionAddr field in the QuorumPeer.QuorumServer class volatil=
e
>>>>>>> =A0 to prevent races.
>>>>>>>
>>>>>>> =A0 I haven't checked this change yet fully for implications but do=
ing
>>>>>>> =A0 a quick test on some machines at least showed it would solve ou=
r
>>>>>>> =A0 use case. What do the more expert users / maintainers think - i=
s
>>>>>>> =A0 it even worthwhile to go that route?
>>>>>>>
>>>>>>> =A0 Am 14.03.2012 17:04, schrieb ext Christian Ziech:
>>>>>>>
>>>>>>> =A0 =A0 =A0 LEt me describe our upcoming use case in a few words: W=
e are
>>>>>>> =A0 =A0 =A0 planning to use zookeeper in a cloud were typically nod=
es come
>>>>>>> =A0 =A0 =A0 and go unpredictably. We could ensure that we always ha=
ve a
>>>>>>> =A0 =A0 =A0 more or less fixed quorum of zookeeper servers with a f=
ixed
>>>>>>> =A0 =A0 =A0 set of host names. However the IPs associated with the =
host
>>>>>>> =A0 =A0 =A0 names would change every time a new server comes up. I =
browsed
>>>>>>> =A0 =A0 =A0 the code a little and it seems right now that the only =
problem
>>>>>>> =A0 =A0 =A0 is that the zookeeper server is remembering the resolve=
d
>>>>>>> =A0 =A0 =A0 InetSocketAddress in its QuorumPeer hash map.
>>>>>>>
>>>>>>> =A0 =A0 =A0 I saw that possibly ZOOKEEPER-107 would also solve that
>>>>>>> =A0 =A0 =A0 problem but possibly in a more generic way than actuall=
y
>>>>>>> =A0 =A0 =A0 needed (perhaps here I underestimate the impact of join=
ing as
>>>>>>> =A0 =A0 =A0 a server with an empty data directory to replace a serv=
er that
>>>>>>> =A0 =A0 =A0 previously had one).
>>>>>>>
>>>>>>> =A0 =A0 =A0 Given that - from looking at ZOOKEEPER-107 - it seems t=
hat it
>>>>>>> =A0 =A0 =A0 will still take some time for the proposed fix to make =
it into
>>>>>>> =A0 =A0 =A0 a release, would it make sense to invest time into a sm=
aller
>>>>>>> =A0 =A0 =A0 work fix just for this "replacing a dropped server with=
out
>>>>>>> =A0 =A0 =A0 rolling restarts" use case? Would there be a chance tha=
t a fix
>>>>>>> =A0 =A0 =A0 for this makes it into the 3.4.x branch?
>>>>>>>
>>>>>>> =A0 =A0 =A0 Are there perhaps other ways to get this use case suppo=
rted
>>>>>>> =A0 =A0 =A0 without the need for doing rolling restarts whenever we=
 need
>>>>>>> =A0 =A0 =A0 to replace one of the zookeeper servers?
>>>>>>>
>