zookeeper-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chang Song <tru64...@me.com>
Subject Re: Serious problem processing hearbeat on login stampede
Date Thu, 14 Apr 2011 22:02:15 GMT

2011. 4. 15., 오전 1:04, Patrick Hunt 작성:

> 2011/4/14 Chang Song <tru64ufs@me.com>:
>>> 2) regarding IO, if you run 'iostat -x 2' on the zk servers while your
>>> issue is happening, what's the %util of the disk? what's the iowait
>>> look like?
>> Again, no I/O at all.   0%
> This is simply not possible.
> Sessions are persistent. Each time a session is created, and each time
> it is closed, a transaction is written by the zk server to the data
> directory. Additionally log4j based logs are also being streamed to
> the disk. Each of these activities will cause disk IO that will show
> up on iostat.

Pat. I didn't say there wasn't any IO, just said 0% utilization.
Meaning no significant IO.  It is possible for our monitoring agent
can miss some updates since we have 5 min. monitoring.
I will try to login to the server and watch.

But since this is session related, are you using fsync() to flush
log buffer out to disk? Then I should immediately see io activity
out of the roof.

>> Patrick. They are not continuously login/logout.
>> Maybe a couple of times a week. and before they push new feature.
>> When this happens, clients in group A drops out of clusters, which causes
>> problem to other unrelated services.
> Ok, good to know.
>> It is not about use case, because ZK clients simply tried to connect to
>> ZK ensemble. No use case applies. Just many clients login at the
>> same time or expires at the same time or close session at the same time.
> As I mentioned, I've seen cluster sizes of 10,000 clients (10x what
> you report) that didn't have this issue. While bugs might be lurking,
> I've also worked with many teams deploying clusters (probably close to
> 100 by now), some of which had problems, the suggestions I'm making to
> you are based on that experience.

Sure. I understand.

>> Heartbeats should be handled in an isolated queue and a
>> dedicated thread.  I don't think we need strict ordering keeping
>> of heartbeats, do we?
> ZK is purposely architected this way, it is not a mistake/bug. It is a
> falicy for a highly available service to respond quickly to a
> heartbeat when it cannot service regular requests in a timely fashion.
> This is one of the main reasons why heartbeats are handled in this
> way.

If that's the case, we REALLY need to fix this problem hard way.


> Patrick
>>> Patrick
>>>> It's about CommitProcessor thread queueing (in leader).
>>>> QueuedRequests goes up to 800, so does commitedRequests and
>>>> PendingRequestElapsedTime. PendingRequestElapsedTime
>>>> goes up to 8.8 seconds during this flood.
>>>> To exactly reproduce this scenario, easiest way is to
>>>> - suspend All JVM client with debugger
>>>> - Cause all client JVM OOME to create heap dump
>>>> in group B. All clients in group A will not be able to receive
>>>> ping response in 5 seconds.
>>>> We need to fix this as soon as possible.
>>>> What we do as a workaround is to raise sessionTimeout to 40 sec.
>>>> At least clients in Group A survives. But this increases
>>>> our cluster failover time significantly.
>>>> Thank you, Patrick.
>>>> ps. We actually push ping request to FinalRequestProcessor as soon
>>>>    as the packet identifies itself as ping. No dice.
>>>> 2011. 4. 14., 오전 12:21, Patrick Hunt 작성:
>>>>> Hi Chang, it sounds like you may have an issue with your cluster
>>>>> environment/setup, or perhaps a resource (GC/mem) issue. Have you
>>>>> looked through the troubleshooting guide?
>>>>> https://cwiki.apache.org/confluence/display/ZOOKEEPER/Troubleshooting
>>>>> In particular 1000 clients connecting should be fine, I've personally
>>>>> seen clusters of 7-10 thousand clients. Keep in mind that each session
>>>>> establishment is essentially a write (so the quorum in involved) and
>>>>> what we typically see there is that the cluster configuration has
>>>>> issues. 14 seconds for a ping response is huge and indicates one of
>>>>> the following may be an underlying cause:
>>>>> 1) are you running in a virtualized environment?
>>>>> 2) are you co-locating other services on the same host(s) that make up
>>>>> the ZK serving cluster?
>>>>> 3) have you followed the admin guide's "things to avoid"?
>>>>> http://zookeeper.apache.org/doc/r3.3.3/zookeeperAdmin.html#sc_commonProblems
>>>>> In particular ensuring that you are not swapping or going into gc
>>>>> pause (both on the server and the client)
>>>>> a) try turning on GC logging and ensure that you are not going into GC
>>>>> pause, see the troubleshooting guide, this is the most common cause of
>>>>> high latency for the clients
>>>>> b) ensure that you are not swapping
>>>>> c) ensure that other processes are not causing log writing
>>>>> (transactional logging) to be slow.
>>>>> Patrick
>>>>> On Wed, Apr 13, 2011 at 6:35 AM, Chang Song <tru64ufs@me.com> wrote:
>>>>>> Hello, folks.
>>>>>> We have ran into a very serious issue with Zookeeper.
>>>>>> Here's a brief scenario.
>>>>>> We have some Zookeeper clients with session timeout of 15 sec (thus
5 sec ping), let's called
>>>>>> these clients, group A.
>>>>>> Now 1000 new clients (let's call these, group B) starts up at the
same time trying to
>>>>>> connect to a three-node ZK ensemble, creating ZK createSession stampede.
>>>>>> Now almost all clients in group A is not able to exchange ping within
session expire time (15 sec).
>>>>>> Thus clients in group A drops out of the cluster.
>>>>>> We have looked into this issue a bit, found mostly synchronous nature
of session queue processing.
>>>>>> Latency between ping request and response ranges from 10ms up to
14 seconds during this login stampede.
>>>>>> Since session timeout is serious matter for our cluster, thus ping
should be done in psuedo realtime fashion.
>>>>>> I don't know exactly how these ping timeout policy in clients and
server, but failure to receive ping
>>>>>> response in clients due to zookeeper login session seem very nonsense
to me.
>>>>>> Shouldn't we have a separate ping/heartbeat queue and thread?
>>>>>> Or even multiple ping queues/threads to keep realtime heartbeat?
>>>>>> THis is very serious issue with Zookeeper for our mission-critical
system. Could anyone
>>>>>> look into this?
>>>>>> I will try to file a bug.
>>>>>> Thank you.
>>>>>> Chang

View raw message