zookeeper-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alexander Shraer <shra...@gmail.com>
Subject Re: Zookeeper on short lived VMs and ZOOKEEPER-107
Date Thu, 15 Mar 2012 22:54:23 GMT
I think the concern is that the old VM can recover and try to
reconnect. Theoretically you could even go back and forth between new
and old VM. For example, suppose that you have servers
A, B and C in the cluster, A is the leader. C is slow and "replaced"
with C', then update U is acked by A and C', then A fails. In this
situation you cannot have additional failures. But with the
automatic replacement thing it can (theoretically) happen that C'
becomes a little slow, C connects to B and is chosen as leader, and
the committed update U is lost forever. This is perhaps unlikely but


On Thu, Mar 15, 2012 at 1:35 PM,  <christian.ziech@nokia.com> wrote:
> I agree with your points about any kind of VMs having a hard to predict runtime behaviour
and that participants of the zookeeper ensemble running on a VM could fail to send keep-alives
for an uncertain amount of time. But I don't yet understand how that would break the approach
I was mentioning: Just trying to re-resolve the InetAddress after an IO exception should in
that case still lead to the same original IP address (and eventually to that node rejoining
the ensemble).
> Only if that host name (the old node was using) would be re-assigned to another instance
this step of re-resolving would point to a new IP (and hence cause the old server to be replaced).
> Did I understand your objection correctly?
> ________________________________________
> Von: ext Ted Dunning [ted.dunning@gmail.com]
> Gesendet: Donnerstag, 15. März 2012 19:50
> Bis: user@zookeeper.apache.org
> Cc: shralex@gmail.com
> Betreff: Re: Zookeeper on short lived VMs and ZOOKEEPER-107
> Alexander's comment still applies.
> VM's can function or go away completely, but they can also malfunction
> in more subtle ways such that they just go VEEEERRRRY slowly.  You
> have to account for that failure mode.  These failures can even be
> transient.
> This would probably break your approach.
> On 3/15/12, christian.ziech@nokia.com <christian.ziech@nokia.com> wrote:
>> Oh sorry there is a slight misunderstanding. With VM I did not mean the java
>> vm but the Linux vm that contains the zookeeper node. We get notified if
>> that goes away and is repurposed.
>> BR
>>   Christian
>> Gesendet von meinem Nokia Lumia 800
>> ________________________________
>> Von: ext Alexander Shraer
>> Gesendet: 15.03.2012 16:33
>> An: user@zookeeper.apache.org; Ziech Christian (Nokia-LC/Berlin)
>> Betreff: Re: Zookeeper on short lived VMs and ZOOKEEPER-107
>> yes, by replacing x at a time from 2x+1 you have quorum intersection.
>> i have one more question - zookeeper itself doesn't assume perfect
>> failure detection, which your scheme requires. what if the VM didn't
>> actually fail but just slow and then tries to reconnect ?
>> On Thu, Mar 15, 2012 at 2:50 AM, Christian Ziech
>> <christian.ziech@nokia.com> wrote:
>>> I don't think that we could be running into a split brain problem in our
>>> use
>>> case.
>>> Let me try to describe the scenario we are worried about (assuming an
>>> ensemble of 5 nodes A,B,C,D,E):
>>> - The ensemble is up and running and in sync
>>> - Node A with the host name "zookeeperA.whatever-domain.priv" goes down
>>> because the VM has gone away
>>> - That removal of the VM is detected and a new VM is spawned with the same
>>> host name "zookeeperA.whatever-domain.priv" - let's call that node A'
>>> - Node A' zookeeper wants to join the cluster - right now this gets
>>> rejected
>>> by the others since A' has a different IP address than A (and the old one
>>> is
>>> "cached" in the InetSocketAddress of the QuorumPeer instance)
>>> We could ensure that at any given time there is only at most one node with
>>> host name "zookeeperA.whatever-domain.priv" known by the ensemble and that
>>> once one node is replaced, it would not come back. Also we could make sure
>>> that our ensemble is big enough to compensate for a replacement of more
>>> than
>>> x nodes at a time (setting it to x*2 + 1 nodes).
>>> So if I did not misestimate our problem it should be (due to the
>>> restrictions) simpler than the problem to be solved by zookeeper-107. My
>>> intention is basically by solving this smaller discrete problem to not
>>> need
>>> to wait for that zookeeper-107 makes it into a release (the assumption is
>>> that a smaller fix has a possibly a chance to make it into the 3.4.x
>>> branch
>>> even).
>>> Am 15.03.2012 07:46, schrieb ext Alexander Shraer:
>>>> Hi Christian,
>>>> ZK-107 would indeed allow you to add/remove servers and change their
>>>> addresses.
>>>> > We could ensure that we always have a more or less fixed quorum of
>>>> > zookeeper servers with a fixed set of host names.
>>>> You should probably also ensure that a majority of the old ensemble
>>>> intersects with a majority of the new one.
>>>> Otherwise you have to run a reconfiguration protocol similarly to ZK-107.
>>>> For example, if you have 3 servers A B and C, and now you're adding D and
>>>> E
>>>> that replace B and C, how would this work ?  it is probable that D and E
>>>> don't have the latest state (as you mention) and A is down or doesn't
>>>> have
>>>> the latest state too (a minority might not have the latest state). Also,
>>>> how
>>>> do you prevent split brain in this case ? meaning B and C thinking that
>>>> they
>>>> are still operational ? perhaps I'm missing something but I suspect that
>>>> the
>>>> change you propose won't be enough...
>>>> Best Regards,
>>>> Alex
>>>> On Wed, Mar 14, 2012 at 10:01 AM, Christian Ziech
>>>> <christian.ziech@nokia.com <mailto:christian.ziech@nokia.com>>
>>>>    Just a small addition: In my opinion the patch could really boil
>>>>    down to add a
>>>>      quorumServer.electionAddr = new
>>>>      InetSocketAddress(electionAddr.getHostName(),
>>>>    electionAddr.getPort());
>>>>    in the catch(IOException e) clause of the connectOne() method of
>>>>    the QuorumCnxManager. In addition on should perhaps make the
>>>>    electionAddr field in the QuorumPeer.QuorumServer class volatile
>>>>    to prevent races.
>>>>    I haven't checked this change yet fully for implications but doing
>>>>    a quick test on some machines at least showed it would solve our
>>>>    use case. What do the more expert users / maintainers think - is
>>>>    it even worthwhile to go that route?
>>>>    Am 14.03.2012 17:04, schrieb ext Christian Ziech:
>>>>        LEt me describe our upcoming use case in a few words: We are
>>>>        planning to use zookeeper in a cloud were typically nodes come
>>>>        and go unpredictably. We could ensure that we always have a
>>>>        more or less fixed quorum of zookeeper servers with a fixed
>>>>        set of host names. However the IPs associated with the host
>>>>        names would change every time a new server comes up. I browsed
>>>>        the code a little and it seems right now that the only problem
>>>>        is that the zookeeper server is remembering the resolved
>>>>        InetSocketAddress in its QuorumPeer hash map.
>>>>        I saw that possibly ZOOKEEPER-107 would also solve that
>>>>        problem but possibly in a more generic way than actually
>>>>        needed (perhaps here I underestimate the impact of joining as
>>>>        a server with an empty data directory to replace a server that
>>>>        previously had one).
>>>>        Given that - from looking at ZOOKEEPER-107 - it seems that it
>>>>        will still take some time for the proposed fix to make it into
>>>>        a release, would it make sense to invest time into a smaller
>>>>        work fix just for this "replacing a dropped server without
>>>>        rolling restarts" use case? Would there be a chance that a fix
>>>>        for this makes it into the 3.4.x branch?
>>>>        Are there perhaps other ways to get this use case supported
>>>>        without the need for doing rolling restarts whenever we need
>>>>        to replace one of the zookeeper servers?

View raw message