zookeeper-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chang Song <tru64...@me.com>
Subject Re: Serious problem processing hearbeat on login stampede
Date Thu, 14 Apr 2011 22:20:46 GMT
sure I will

thank you.

Chang


2011. 4. 15., 오전 7:16, Benjamin Reed 작성:

> when you file the jira can you also note the logging level you are using?
> 
> thanx
> ben
> 
> 2011/4/14 Chang Song <tru64ufs@me.com>:
>> 
>> Yes, Ben.
>> 
>> If you read my emails carefully, I already said it is not heartbeat,
>> it is session establishment / closing gets stamped.
>> Since all the requests' response gets delayed, heartbeats are delayed
>> as well.
>> 
>> 
>> You need to understand that most app can tolerate delay in connect/close,
>> but we cannot tolerate ping delay since we are using ZK heartbeat TO
>> for sole failure detection.
>> We use 15 seconds (5 sec for each ensemble)
>> for session timeout, important server will drop out of the clusters even
>> if the server is not malfunctioning, in some cases, it wreaks havoc on certain
>> services.
>> 
>> 
>> 1. 3.3.3 (latest)
>> 
>> 2. We have a boot disk and usr disk.
>>   But as I said, disk I/O is not an issue that's causing 8 second delay.
>> 
>> My team will file JIRA today, we'll have to discuss on JIRA ;)
>> 
>> Thank you.
>> 
>> Chang
>> 
>> 
>> 
>> 
>> 2011. 4. 15., 오전 2:59, Benjamin Reed 작성:
>> 
>>> chang,
>>> 
>>> if the problem is on client startup, then it isn't the heartbeat
>>> stamped, it is session establishment. the heartbeats are very light
>>> weight, so i can't imagine them causing any issues.
>>> 
>>> the two key issues we need to know are: 1) the version of the server
>>> you are running, and 2) if you are using a dedicated device for the
>>> transaction log.
>>> 
>>> ben
>>> 
>>> 2011/4/14 Patrick Hunt <phunt@apache.org>:
>>>> 2011/4/14 Chang Song <tru64ufs@me.com>:
>>>>>> 2) regarding IO, if you run 'iostat -x 2' on the zk servers while
your
>>>>>> issue is happening, what's the %util of the disk? what's the iowait
>>>>>> look like?
>>>>>> 
>>>>> 
>>>>> Again, no I/O at all.   0%
>>>>> 
>>>> 
>>>> This is simply not possible.
>>>> 
>>>> Sessions are persistent. Each time a session is created, and each time
>>>> it is closed, a transaction is written by the zk server to the data
>>>> directory. Additionally log4j based logs are also being streamed to
>>>> the disk. Each of these activities will cause disk IO that will show
>>>> up on iostat.
>>>> 
>>>>> Patrick. They are not continuously login/logout.
>>>>> Maybe a couple of times a week. and before they push new feature.
>>>>> When this happens, clients in group A drops out of clusters, which causes
>>>>> problem to other unrelated services.
>>>>> 
>>>> 
>>>> Ok, good to know.
>>>> 
>>>>> 
>>>>> It is not about use case, because ZK clients simply tried to connect
to
>>>>> ZK ensemble. No use case applies. Just many clients login at the
>>>>> same time or expires at the same time or close session at the same time.
>>>>> 
>>>> 
>>>> As I mentioned, I've seen cluster sizes of 10,000 clients (10x what
>>>> you report) that didn't have this issue. While bugs might be lurking,
>>>> I've also worked with many teams deploying clusters (probably close to
>>>> 100 by now), some of which had problems, the suggestions I'm making to
>>>> you are based on that experience.
>>>> 
>>>>> Heartbeats should be handled in an isolated queue and a
>>>>> dedicated thread.  I don't think we need strict ordering keeping
>>>>> of heartbeats, do we?
>>>> 
>>>> ZK is purposely architected this way, it is not a mistake/bug. It is a
>>>> falicy for a highly available service to respond quickly to a
>>>> heartbeat when it cannot service regular requests in a timely fashion.
>>>> This is one of the main reasons why heartbeats are handled in this
>>>> way.
>>>> 
>>>> Patrick
>>>> 
>>>>>> Patrick
>>>>>> 
>>>>>>> It's about CommitProcessor thread queueing (in leader).
>>>>>>> QueuedRequests goes up to 800, so does commitedRequests and
>>>>>>> PendingRequestElapsedTime. PendingRequestElapsedTime
>>>>>>> goes up to 8.8 seconds during this flood.
>>>>>>> 
>>>>>>> To exactly reproduce this scenario, easiest way is to
>>>>>>> 
>>>>>>> - suspend All JVM client with debugger
>>>>>>> - Cause all client JVM OOME to create heap dump
>>>>>>> 
>>>>>>> in group B. All clients in group A will not be able to receive
>>>>>>> ping response in 5 seconds.
>>>>>>> 
>>>>>>> We need to fix this as soon as possible.
>>>>>>> What we do as a workaround is to raise sessionTimeout to 40 sec.
>>>>>>> At least clients in Group A survives. But this increases
>>>>>>> our cluster failover time significantly.
>>>>>>> 
>>>>>>> Thank you, Patrick.
>>>>>>> 
>>>>>>> 
>>>>>>> ps. We actually push ping request to FinalRequestProcessor as
soon
>>>>>>>   as the packet identifies itself as ping. No dice.
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 2011. 4. 14., 오전 12:21, Patrick Hunt 작성:
>>>>>>> 
>>>>>>>> Hi Chang, it sounds like you may have an issue with your
cluster
>>>>>>>> environment/setup, or perhaps a resource (GC/mem) issue.
Have you
>>>>>>>> looked through the troubleshooting guide?
>>>>>>>> https://cwiki.apache.org/confluence/display/ZOOKEEPER/Troubleshooting
>>>>>>>> 
>>>>>>>> In particular 1000 clients connecting should be fine, I've
personally
>>>>>>>> seen clusters of 7-10 thousand clients. Keep in mind that
each session
>>>>>>>> establishment is essentially a write (so the quorum in involved)
and
>>>>>>>> what we typically see there is that the cluster configuration
has
>>>>>>>> issues. 14 seconds for a ping response is huge and indicates
one of
>>>>>>>> the following may be an underlying cause:
>>>>>>>> 
>>>>>>>> 1) are you running in a virtualized environment?
>>>>>>>> 2) are you co-locating other services on the same host(s)
that make up
>>>>>>>> the ZK serving cluster?
>>>>>>>> 3) have you followed the admin guide's "things to avoid"?
>>>>>>>> http://zookeeper.apache.org/doc/r3.3.3/zookeeperAdmin.html#sc_commonProblems
>>>>>>>> In particular ensuring that you are not swapping or going
into gc
>>>>>>>> pause (both on the server and the client)
>>>>>>>> a) try turning on GC logging and ensure that you are not
going into GC
>>>>>>>> pause, see the troubleshooting guide, this is the most common
cause of
>>>>>>>> high latency for the clients
>>>>>>>> b) ensure that you are not swapping
>>>>>>>> c) ensure that other processes are not causing log writing
>>>>>>>> (transactional logging) to be slow.
>>>>>>>> 
>>>>>>>> Patrick
>>>>>>>> 
>>>>>>>> On Wed, Apr 13, 2011 at 6:35 AM, Chang Song <tru64ufs@me.com>
wrote:
>>>>>>>>> Hello, folks.
>>>>>>>>> 
>>>>>>>>> We have ran into a very serious issue with Zookeeper.
>>>>>>>>> Here's a brief scenario.
>>>>>>>>> 
>>>>>>>>> We have some Zookeeper clients with session timeout of
15 sec (thus 5 sec ping), let's called
>>>>>>>>> these clients, group A.
>>>>>>>>> 
>>>>>>>>> Now 1000 new clients (let's call these, group B) starts
up at the same time trying to
>>>>>>>>> connect to a three-node ZK ensemble, creating ZK createSession
stampede.
>>>>>>>>> 
>>>>>>>>> Now almost all clients in group A is not able to exchange
ping within session expire time (15 sec).
>>>>>>>>> Thus clients in group A drops out of the cluster.
>>>>>>>>> 
>>>>>>>>> We have looked into this issue a bit, found mostly synchronous
nature of session queue processing.
>>>>>>>>> Latency between ping request and response ranges from
10ms up to 14 seconds during this login stampede.
>>>>>>>>> 
>>>>>>>>> Since session timeout is serious matter for our cluster,
thus ping should be done in psuedo realtime fashion.
>>>>>>>>> 
>>>>>>>>> I don't know exactly how these ping timeout policy in
clients and server, but failure to receive ping
>>>>>>>>> response in clients due to zookeeper login session seem
very nonsense to me.
>>>>>>>>> 
>>>>>>>>> Shouldn't we have a separate ping/heartbeat queue and
thread?
>>>>>>>>> Or even multiple ping queues/threads to keep realtime
heartbeat?
>>>>>>>>> 
>>>>>>>>> THis is very serious issue with Zookeeper for our mission-critical
system. Could anyone
>>>>>>>>> look into this?
>>>>>>>>> 
>>>>>>>>> I will try to file a bug.
>>>>>>>>> 
>>>>>>>>> Thank you.
>>>>>>>>> 
>>>>>>>>> Chang
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>> 
>>>>> 
>>>> 
>> 
>> 


Mime
View raw message