zookeeper-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chang Song <tru64...@me.com>
Subject Re: Serious problem processing hearbeat on login stampede
Date Thu, 14 Apr 2011 22:10:05 GMT

Yes, Ben.

If you read my emails carefully, I already said it is not heartbeat,
it is session establishment / closing gets stamped.
Since all the requests' response gets delayed, heartbeats are delayed
as well.


You need to understand that most app can tolerate delay in connect/close,
but we cannot tolerate ping delay since we are using ZK heartbeat TO
for sole failure detection.  
We use 15 seconds (5 sec for each ensemble)
for session timeout, important server will drop out of the clusters even
if the server is not malfunctioning, in some cases, it wreaks havoc on certain
services.


1. 3.3.3 (latest)

2. We have a boot disk and usr disk. 
    But as I said, disk I/O is not an issue that's causing 8 second delay.

My team will file JIRA today, we'll have to discuss on JIRA ;)

Thank you.

Chang




2011. 4. 15., 오전 2:59, Benjamin Reed 작성:

> chang,
> 
> if the problem is on client startup, then it isn't the heartbeat
> stamped, it is session establishment. the heartbeats are very light
> weight, so i can't imagine them causing any issues.
> 
> the two key issues we need to know are: 1) the version of the server
> you are running, and 2) if you are using a dedicated device for the
> transaction log.
> 
> ben
> 
> 2011/4/14 Patrick Hunt <phunt@apache.org>:
>> 2011/4/14 Chang Song <tru64ufs@me.com>:
>>>> 2) regarding IO, if you run 'iostat -x 2' on the zk servers while your
>>>> issue is happening, what's the %util of the disk? what's the iowait
>>>> look like?
>>>> 
>>> 
>>> Again, no I/O at all.   0%
>>> 
>> 
>> This is simply not possible.
>> 
>> Sessions are persistent. Each time a session is created, and each time
>> it is closed, a transaction is written by the zk server to the data
>> directory. Additionally log4j based logs are also being streamed to
>> the disk. Each of these activities will cause disk IO that will show
>> up on iostat.
>> 
>>> Patrick. They are not continuously login/logout.
>>> Maybe a couple of times a week. and before they push new feature.
>>> When this happens, clients in group A drops out of clusters, which causes
>>> problem to other unrelated services.
>>> 
>> 
>> Ok, good to know.
>> 
>>> 
>>> It is not about use case, because ZK clients simply tried to connect to
>>> ZK ensemble. No use case applies. Just many clients login at the
>>> same time or expires at the same time or close session at the same time.
>>> 
>> 
>> As I mentioned, I've seen cluster sizes of 10,000 clients (10x what
>> you report) that didn't have this issue. While bugs might be lurking,
>> I've also worked with many teams deploying clusters (probably close to
>> 100 by now), some of which had problems, the suggestions I'm making to
>> you are based on that experience.
>> 
>>> Heartbeats should be handled in an isolated queue and a
>>> dedicated thread.  I don't think we need strict ordering keeping
>>> of heartbeats, do we?
>> 
>> ZK is purposely architected this way, it is not a mistake/bug. It is a
>> falicy for a highly available service to respond quickly to a
>> heartbeat when it cannot service regular requests in a timely fashion.
>> This is one of the main reasons why heartbeats are handled in this
>> way.
>> 
>> Patrick
>> 
>>>> Patrick
>>>> 
>>>>> It's about CommitProcessor thread queueing (in leader).
>>>>> QueuedRequests goes up to 800, so does commitedRequests and
>>>>> PendingRequestElapsedTime. PendingRequestElapsedTime
>>>>> goes up to 8.8 seconds during this flood.
>>>>> 
>>>>> To exactly reproduce this scenario, easiest way is to
>>>>> 
>>>>> - suspend All JVM client with debugger
>>>>> - Cause all client JVM OOME to create heap dump
>>>>> 
>>>>> in group B. All clients in group A will not be able to receive
>>>>> ping response in 5 seconds.
>>>>> 
>>>>> We need to fix this as soon as possible.
>>>>> What we do as a workaround is to raise sessionTimeout to 40 sec.
>>>>> At least clients in Group A survives. But this increases
>>>>> our cluster failover time significantly.
>>>>> 
>>>>> Thank you, Patrick.
>>>>> 
>>>>> 
>>>>> ps. We actually push ping request to FinalRequestProcessor as soon
>>>>>    as the packet identifies itself as ping. No dice.
>>>>> 
>>>>> 
>>>>> 
>>>>> 2011. 4. 14., 오전 12:21, Patrick Hunt 작성:
>>>>> 
>>>>>> Hi Chang, it sounds like you may have an issue with your cluster
>>>>>> environment/setup, or perhaps a resource (GC/mem) issue. Have you
>>>>>> looked through the troubleshooting guide?
>>>>>> https://cwiki.apache.org/confluence/display/ZOOKEEPER/Troubleshooting
>>>>>> 
>>>>>> In particular 1000 clients connecting should be fine, I've personally
>>>>>> seen clusters of 7-10 thousand clients. Keep in mind that each session
>>>>>> establishment is essentially a write (so the quorum in involved)
and
>>>>>> what we typically see there is that the cluster configuration has
>>>>>> issues. 14 seconds for a ping response is huge and indicates one
of
>>>>>> the following may be an underlying cause:
>>>>>> 
>>>>>> 1) are you running in a virtualized environment?
>>>>>> 2) are you co-locating other services on the same host(s) that make
up
>>>>>> the ZK serving cluster?
>>>>>> 3) have you followed the admin guide's "things to avoid"?
>>>>>> http://zookeeper.apache.org/doc/r3.3.3/zookeeperAdmin.html#sc_commonProblems
>>>>>> In particular ensuring that you are not swapping or going into gc
>>>>>> pause (both on the server and the client)
>>>>>> a) try turning on GC logging and ensure that you are not going into
GC
>>>>>> pause, see the troubleshooting guide, this is the most common cause
of
>>>>>> high latency for the clients
>>>>>> b) ensure that you are not swapping
>>>>>> c) ensure that other processes are not causing log writing
>>>>>> (transactional logging) to be slow.
>>>>>> 
>>>>>> Patrick
>>>>>> 
>>>>>> On Wed, Apr 13, 2011 at 6:35 AM, Chang Song <tru64ufs@me.com>
wrote:
>>>>>>> Hello, folks.
>>>>>>> 
>>>>>>> We have ran into a very serious issue with Zookeeper.
>>>>>>> Here's a brief scenario.
>>>>>>> 
>>>>>>> We have some Zookeeper clients with session timeout of 15 sec
(thus 5 sec ping), let's called
>>>>>>> these clients, group A.
>>>>>>> 
>>>>>>> Now 1000 new clients (let's call these, group B) starts up at
the same time trying to
>>>>>>> connect to a three-node ZK ensemble, creating ZK createSession
stampede.
>>>>>>> 
>>>>>>> Now almost all clients in group A is not able to exchange ping
within session expire time (15 sec).
>>>>>>> Thus clients in group A drops out of the cluster.
>>>>>>> 
>>>>>>> We have looked into this issue a bit, found mostly synchronous
nature of session queue processing.
>>>>>>> Latency between ping request and response ranges from 10ms up
to 14 seconds during this login stampede.
>>>>>>> 
>>>>>>> Since session timeout is serious matter for our cluster, thus
ping should be done in psuedo realtime fashion.
>>>>>>> 
>>>>>>> I don't know exactly how these ping timeout policy in clients
and server, but failure to receive ping
>>>>>>> response in clients due to zookeeper login session seem very
nonsense to me.
>>>>>>> 
>>>>>>> Shouldn't we have a separate ping/heartbeat queue and thread?
>>>>>>> Or even multiple ping queues/threads to keep realtime heartbeat?
>>>>>>> 
>>>>>>> THis is very serious issue with Zookeeper for our mission-critical
system. Could anyone
>>>>>>> look into this?
>>>>>>> 
>>>>>>> I will try to file a bug.
>>>>>>> 
>>>>>>> Thank you.
>>>>>>> 
>>>>>>> Chang
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>> 
>>>>> 
>>> 
>>> 
>> 


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message