zookeeper-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Hulunbier <hulunb...@gmail.com>
Subject Re: Getting confused with the "recipe for lock"
Date Tue, 15 Jan 2013 06:32:10 GMT
Benjamin,

thanks a lot, for your response and the great product you guys
designed and implemented.

> these assumptions are manifest in ... the session timeouts of the clients.

Does this mean that session_expired event may be triggered all by
zk-client-library itself ?(by something like a built-in client-local
timer, without notification from zk server? )

(I am digging into the source code, but in case of misunderstanding of
the code, I need your confirmation please)

> HBase region servers went into gc for many minutes and then woke up still thinking they
are the leader

Could this happen if I just follow(correctly and without a
client-local timer or external fencing resources) the recipe for
distributed clock?





On Tue, Jan 15, 2013 at 1:27 PM, Benjamin Reed <breed@apache.org> wrote:
> sorry to jump in the middle, but i thought i'd point out a couple of things.
>
> at the heart of ZK is Zab, which is an atomic broadcast protocol (it
> actually has stronger guarantees than just atomic broadcast: it also
> guarantees primary order). updates go through this protocol which
> gives us sequential consistency for writes.
>
> failure detection uses timeouts, as most failure detectors do, so we
> have some assumptions on bounds of message delays and drifts of
> clocks. in the end, these assumptions are manifest in the sync and
> initial timeouts of the server and the session timeouts of the
> clients.
>
> as long as the assumptions are true, things will stay consistent, if
> the assumptions fail, such as when HBase region servers went into gc
> for many minutes and then woke up still thinking they are the leader,
> bad things can happen. the fix may be to use more conservative
> assumptions or to use a fencing scheme with external resources.
>
> if the assumptions are violated by the zookeeper cluster, it will
> manifest as a liveness problem rather than a safety issue. (in theory
> at least, we do have bugs occasionally :)
>
> ben
>
> On Mon, Jan 14, 2013 at 7:45 PM, Hulunbier <hulunbier@gmail.com> wrote:
>> Hi Jordan,
>>
>>> Why would client 1s connection be unstable but client 2s not? In any normal usage
the ZK clients are going to be on the same network. Or, are you thinking cross-data-center
usage? In my opinion, ZooKeeper is not suited to cross data center usage.
>>
>> er... the word "unstable" I used is misleading; A full functional(or
>> stable?) tcp connection is supposed to be encountered with some
>> network congestion, and should / can handle this situation well, but
>> might be with some delay of delivering the segments; High volume of
>> traffic in LAN may lead to the above situation, and it is not rare, I
>> think.
>>
>> Even if there was no such congestion, there is always a time lag,
>> between zk sends session-timeout message and client receives the
>> message;
>> Without any assumption, we can not ensure that , the client could be
>> ware of that it no longer has the lock - before other clients got the
>> node_not_exist notification and successful executed getChildren and
>> thought it(one of the others) having the lock.
>>
>> I think in practice, we could (or have to) accept this assumption :
>> "the server’s clock advance no faster than a known constant factor
>> faster than the client’s".
>>
>> But the assumption itself is not enough for the correctness of lock
>> protocol; because the client can only passively waiting for the
>> session_time_out message, so the client may need a timer to explicitly
>> check time elapsed.
>>
>> But the recipe claims clearly that:  "at any snapshot in time no two
>> clients think they hold the same lock", and "There is no polling or
>> timeouts."
>>
>>
>>> In any event, as others have pointed out, Zookeeper is _not_ a transactional
system.
>>
>>> It is an eventually consistent system that will give you a reasonable degree
of distributed coordination semantics.
>>
>> I should admit that I do not know whether ZK is eventually consistent
>> , transactional or not. (BTW, there is a recipe for 2pc, and some guys
>> claim that *Zab* is Sequential Consistent);
>>
>> Does these properties of ZK implies there is assumptions of clock drift?
>>
>>>There are edge cases as you describe but they are in the level of noise.
>>
>> You might be right, but for me, edge cases is what I am worrying about
>> (please do not get me wrong, I mean, different applications have
>> different requirements / constraints).
>>
>>>
>>> -Jordan
>>>
>>> On Jan 14, 2013, at 5:52 PM, Hulunbier <hulunbier@gmail.com> wrote:
>>>
>>>> Hi Vitalii,
>>>>
>>>> Thanks a lot, got your idea.
>>>>
>>>> Suppose we are measuring the time of events outsides the system(zk &
clients) .
>>>>
>>>> And we have no client side time tracking routine.
>>>>
>>>> And t_i < t_k if  i < k
>>>>
>>>> t_0 :
>>>>
>>>> client1 has created lock/node1, client2 has created lock/node2;
>>>> client1 thinks itself holding the lock; client2 does not, and watching
>>>> lock/node1.
>>>>
>>>> t_1 :
>>>>
>>>> ZK thinks client1's session is timeout(let's say, client1 is actually
>>>> failed to send heart-beat message on time, due to a long pause of jvm
>>>> gc).
>>>>
>>>> ZK deletes lock/node1,
>>>> sends timeout message to client1,
>>>> sends "node_not_exist" message to client2 (or send this message before
>>>> the deletion, but it does not matter in our case)
>>>>
>>>> but for some reason, link between zk and client1 becomes very unstable,
>>>> high packet loss, large amount of packet retransmission,
>>>> which leads to a significant packet transmission delay(between client1
>>>> and zk only), but the tcp connection is NOT broken.
>>>>
>>>> t_2:
>>>>
>>>> client2 got the "node_not_exist" event, and issues the getChildren Cmd
>>>>
>>>> t_3:
>>>>
>>>> client2 found the only node lock/node2, and thinks itself holding the
>>>> lock, and begins acting like a lock owner.
>>>>
>>>> (at the same time, client1 is also thinking itself holding the lock)
>>>>
>>>> t_4:
>>>>
>>>> session_timeout message not reach client1 yet,
>>>>
>>>> client1's jvm gc completed, doing something as the lock-owner.
>>>>
>>>> t_5:
>>>>
>>>> network becomes stable, finally, the session_timeout message sent from
>>>> zk reached client1;
>>>>
>>>> client1 thinks itself no longer holding the lock, but it is too late,
>>>> it has done something really bad between t_4 and t_5.
>>>>
>>>> --------------------------
>>>>
>>>> Sorry for the grammar, I am not a native English speaker.
>>>>
>>>>
>>>> On Mon, Jan 14, 2013 at 11:38 PM, Vitalii Tymchyshyn <tivv00@gmail.com>
wrote:
>>>>> There are two events: disconnected and session expired. The ephemeral
nodes
>>>>> are removed after the second one. The client  receives both. So to
>>>>> implement "at most one lock holder" scheme, client owning lock must think
>>>>> it've lost lock ownership since it've received disconnected event. So,
>>>>> there is period of time between disconnect and session expired when noone
>>>>> should have the lock. It's "safety" time to accomodate for time shifts,
>>>>> network latencies, lock ownership recheck interval (in case when client
>>>>> can't stop using resource immediatelly and simply checks regulary if
it
>>>>> still holds the lock).
>>>>>
>>>>>
>>>>>
>>>>> 2013/1/14 Hulunbier <hulunbier@gmail.com>
>>>>>
>>>>>> Hi Vitalii,
>>>>>>
>>>>>>> I don't see why clock must be in sync.
>>>>>>
>>>>>> I don't see any reason to precisely sync the clocks either (but if
we
>>>>>> could ... that would be wonderful.).
>>>>>>
>>>>>> By *some constrains of clock drift*, I mean :
>>>>>>
>>>>>> "Every node has a clock, and all clocks increase at the same rate"
>>>>>> or
>>>>>> "the server’s clock advance no faster than a known constant factor
>>>>>> faster than the client’s.".
>>>>>>
>>>>>>
>>>>>>> Also note the difference between disconnected and session
>>>>>>> expired events. This time difference is when client knows "something's
>>>>>>> wrong", but another client did not get a lock yet.
>>>>>>
>>>>>> sorry, but I failed to get your idea well; would you please give
me
>>>>>> some further explanation?
>>>>>>
>>>>>>
>>>>>> On Mon, Jan 14, 2013 at 6:37 PM, Vitalii Tymchyshyn <tivv00@gmail.com>
>>>>>> wrote:
>>>>>>> I don't see why clock must be in sync. They are counting time
periods
>>>>>>> (timeouts). Also note the difference between disconnected and
session
>>>>>>> expired events. This time difference is when client knows "something's
>>>>>>> wrong", but another client did not get a lock yet. You will have
problems
>>>>>>> if client can't react (and release resources) between this two
events.
>>>>>>>
>>>>>>> Best regards, Vitalii Tymchyshyn
>>>>>>>
>>>>>>>
>>>>>>> 2013/1/13 Hulunbier <hulunbier@gmail.com>
>>>>>>>
>>>>>>>> Thanks Jordan,
>>>>>>>>
>>>>>>>>> Assuming the clocks are in sync between all participants…
>>>>>>>>
>>>>>>>> imho, perfect clock synchronization in a distributed system
is very
>>>>>>>> hard (if it can be).
>>>>>>>>
>>>>>>>>> Someone with better understanding of ZK internals can
correct me, but
>>>>>>>> this is my understanding.
>>>>>>>>
>>>>>>>> I think I might have missed some very important and subtile(or
>>>>>>>> obvious?) points of the recipe / ZK protocol.
>>>>>>>>
>>>>>>>> I just can not believe that, there could be such type of
a flaw in the
>>>>>>>> lock-recipe,  for so long time,  without anybody has pointed
it out.
>>>>>>>>
>>>>>>>> On Sun, Jan 13, 2013 at 9:31 AM, Jordan Zimmerman
>>>>>>>> <jordan@jordanzimmerman.com> wrote:
>>>>>>>>> On Jan 12, 2013, at 2:30 AM, Hulunbier <hulunbier@gmail.com>
wrote:
>>>>>>>>>
>>>>>>>>>> Suppose the network link betweens client1 and server
is at very low
>>>>>>>>>> quality (high packet loss rate?) but still fully
functional.
>>>>>>>>>>
>>>>>>>>>> Client1 may be happily sending heart-beat-messages
to server without
>>>>>>>>>> notice anything; but ZK server could be unable to
receive
>>>>>>>>>> heart-beat-messages from client1 for a long period
of time , which
>>>>>>>>>> leads ZK server to timeout client1's session, and
delete the
>>>>>> ephemeral
>>>>>>>>>> node
>>>>>>>>>
>>>>>>>>> I believe the heartbeats go both ways. Thus, if the client
doesn't
>>>>>> hear
>>>>>>>> from the server it will post a Disconnected event.
>>>>>>>>>
>>>>>>>>>> But I still feels that, no matter how well a ZK application
behaves,
>>>>>>>>>> if we use ephemeral node in the lock-recipe; we can
not guarantee "at
>>>>>>>>>> any snapshot in time no two clients think they hold
the same lock",
>>>>>>>>>> which is the fundamental requirement/constraint for
a lock.
>>>>>>>>>
>>>>>>>>> Assuming the clocks are in sync between all participants…
The server
>>>>>> and
>>>>>>>> the client that holds the lock should determine that there
is a
>>>>>>>> disconnection at nearly the same time. I imagine that there
is a certain
>>>>>>>> amount of time (a few milliseconds) overlap here. But, the
next client
>>>>>>>> wouldn't get the notification immediately anyway. Further,
when the next
>>>>>>>> client gets the notification, it still needs to execute a
getChildren()
>>>>>>>> command, process the results, etc. before it can determine
that it has
>>>>>> the
>>>>>>>> lock. That two clients would think they have the lock at
the same time
>>>>>> is a
>>>>>>>> vanishingly small possibility. Even if it did happen it would
only be
>>>>>> for a
>>>>>>>> few milliseconds at most.
>>>>>>>>>
>>>>>>>>> Someone with better understanding of ZK internals can
correct me, but
>>>>>>>> this is my understanding.
>>>>>>>>>
>>>>>>>>> -Jordan
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Best regards,
>>>>>>> Vitalii Tymchyshyn
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Best regards,
>>>>> Vitalii Tymchyshyn
>>>

Mime
View raw message