zookeeper-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Benjamin Reed <br...@apache.org>
Subject Re: Serious problem processing hearbeat on login stampede
Date Thu, 14 Apr 2011 22:16:39 GMT
when you file the jira can you also note the logging level you are using?

thanx
ben

2011/4/14 Chang Song <tru64ufs@me.com>:
>
> Yes, Ben.
>
> If you read my emails carefully, I already said it is not heartbeat,
> it is session establishment / closing gets stamped.
> Since all the requests' response gets delayed, heartbeats are delayed
> as well.
>
>
> You need to understand that most app can tolerate delay in connect/close,
> but we cannot tolerate ping delay since we are using ZK heartbeat TO
> for sole failure detection.
> We use 15 seconds (5 sec for each ensemble)
> for session timeout, important server will drop out of the clusters even
> if the server is not malfunctioning, in some cases, it wreaks havoc on certain
> services.
>
>
> 1. 3.3.3 (latest)
>
> 2. We have a boot disk and usr disk.
>    But as I said, disk I/O is not an issue that's causing 8 second delay.
>
> My team will file JIRA today, we'll have to discuss on JIRA ;)
>
> Thank you.
>
> Chang
>
>
>
>
> 2011. 4. 15., 오전 2:59, Benjamin Reed 작성:
>
>> chang,
>>
>> if the problem is on client startup, then it isn't the heartbeat
>> stamped, it is session establishment. the heartbeats are very light
>> weight, so i can't imagine them causing any issues.
>>
>> the two key issues we need to know are: 1) the version of the server
>> you are running, and 2) if you are using a dedicated device for the
>> transaction log.
>>
>> ben
>>
>> 2011/4/14 Patrick Hunt <phunt@apache.org>:
>>> 2011/4/14 Chang Song <tru64ufs@me.com>:
>>>>> 2) regarding IO, if you run 'iostat -x 2' on the zk servers while your
>>>>> issue is happening, what's the %util of the disk? what's the iowait
>>>>> look like?
>>>>>
>>>>
>>>> Again, no I/O at all.   0%
>>>>
>>>
>>> This is simply not possible.
>>>
>>> Sessions are persistent. Each time a session is created, and each time
>>> it is closed, a transaction is written by the zk server to the data
>>> directory. Additionally log4j based logs are also being streamed to
>>> the disk. Each of these activities will cause disk IO that will show
>>> up on iostat.
>>>
>>>> Patrick. They are not continuously login/logout.
>>>> Maybe a couple of times a week. and before they push new feature.
>>>> When this happens, clients in group A drops out of clusters, which causes
>>>> problem to other unrelated services.
>>>>
>>>
>>> Ok, good to know.
>>>
>>>>
>>>> It is not about use case, because ZK clients simply tried to connect to
>>>> ZK ensemble. No use case applies. Just many clients login at the
>>>> same time or expires at the same time or close session at the same time.
>>>>
>>>
>>> As I mentioned, I've seen cluster sizes of 10,000 clients (10x what
>>> you report) that didn't have this issue. While bugs might be lurking,
>>> I've also worked with many teams deploying clusters (probably close to
>>> 100 by now), some of which had problems, the suggestions I'm making to
>>> you are based on that experience.
>>>
>>>> Heartbeats should be handled in an isolated queue and a
>>>> dedicated thread.  I don't think we need strict ordering keeping
>>>> of heartbeats, do we?
>>>
>>> ZK is purposely architected this way, it is not a mistake/bug. It is a
>>> falicy for a highly available service to respond quickly to a
>>> heartbeat when it cannot service regular requests in a timely fashion.
>>> This is one of the main reasons why heartbeats are handled in this
>>> way.
>>>
>>> Patrick
>>>
>>>>> Patrick
>>>>>
>>>>>> It's about CommitProcessor thread queueing (in leader).
>>>>>> QueuedRequests goes up to 800, so does commitedRequests and
>>>>>> PendingRequestElapsedTime. PendingRequestElapsedTime
>>>>>> goes up to 8.8 seconds during this flood.
>>>>>>
>>>>>> To exactly reproduce this scenario, easiest way is to
>>>>>>
>>>>>> - suspend All JVM client with debugger
>>>>>> - Cause all client JVM OOME to create heap dump
>>>>>>
>>>>>> in group B. All clients in group A will not be able to receive
>>>>>> ping response in 5 seconds.
>>>>>>
>>>>>> We need to fix this as soon as possible.
>>>>>> What we do as a workaround is to raise sessionTimeout to 40 sec.
>>>>>> At least clients in Group A survives. But this increases
>>>>>> our cluster failover time significantly.
>>>>>>
>>>>>> Thank you, Patrick.
>>>>>>
>>>>>>
>>>>>> ps. We actually push ping request to FinalRequestProcessor as soon
>>>>>>    as the packet identifies itself as ping. No dice.
>>>>>>
>>>>>>
>>>>>>
>>>>>> 2011. 4. 14., 오전 12:21, Patrick Hunt 작성:
>>>>>>
>>>>>>> Hi Chang, it sounds like you may have an issue with your cluster
>>>>>>> environment/setup, or perhaps a resource (GC/mem) issue. Have
you
>>>>>>> looked through the troubleshooting guide?
>>>>>>> https://cwiki.apache.org/confluence/display/ZOOKEEPER/Troubleshooting
>>>>>>>
>>>>>>> In particular 1000 clients connecting should be fine, I've personally
>>>>>>> seen clusters of 7-10 thousand clients. Keep in mind that each
session
>>>>>>> establishment is essentially a write (so the quorum in involved)
and
>>>>>>> what we typically see there is that the cluster configuration
has
>>>>>>> issues. 14 seconds for a ping response is huge and indicates
one of
>>>>>>> the following may be an underlying cause:
>>>>>>>
>>>>>>> 1) are you running in a virtualized environment?
>>>>>>> 2) are you co-locating other services on the same host(s) that
make up
>>>>>>> the ZK serving cluster?
>>>>>>> 3) have you followed the admin guide's "things to avoid"?
>>>>>>> http://zookeeper.apache.org/doc/r3.3.3/zookeeperAdmin.html#sc_commonProblems
>>>>>>> In particular ensuring that you are not swapping or going into
gc
>>>>>>> pause (both on the server and the client)
>>>>>>> a) try turning on GC logging and ensure that you are not going
into GC
>>>>>>> pause, see the troubleshooting guide, this is the most common
cause of
>>>>>>> high latency for the clients
>>>>>>> b) ensure that you are not swapping
>>>>>>> c) ensure that other processes are not causing log writing
>>>>>>> (transactional logging) to be slow.
>>>>>>>
>>>>>>> Patrick
>>>>>>>
>>>>>>> On Wed, Apr 13, 2011 at 6:35 AM, Chang Song <tru64ufs@me.com>
wrote:
>>>>>>>> Hello, folks.
>>>>>>>>
>>>>>>>> We have ran into a very serious issue with Zookeeper.
>>>>>>>> Here's a brief scenario.
>>>>>>>>
>>>>>>>> We have some Zookeeper clients with session timeout of 15
sec (thus 5 sec ping), let's called
>>>>>>>> these clients, group A.
>>>>>>>>
>>>>>>>> Now 1000 new clients (let's call these, group B) starts up
at the same time trying to
>>>>>>>> connect to a three-node ZK ensemble, creating ZK createSession
stampede.
>>>>>>>>
>>>>>>>> Now almost all clients in group A is not able to exchange
ping within session expire time (15 sec).
>>>>>>>> Thus clients in group A drops out of the cluster.
>>>>>>>>
>>>>>>>> We have looked into this issue a bit, found mostly synchronous
nature of session queue processing.
>>>>>>>> Latency between ping request and response ranges from 10ms
up to 14 seconds during this login stampede.
>>>>>>>>
>>>>>>>> Since session timeout is serious matter for our cluster,
thus ping should be done in psuedo realtime fashion.
>>>>>>>>
>>>>>>>> I don't know exactly how these ping timeout policy in clients
and server, but failure to receive ping
>>>>>>>> response in clients due to zookeeper login session seem very
nonsense to me.
>>>>>>>>
>>>>>>>> Shouldn't we have a separate ping/heartbeat queue and thread?
>>>>>>>> Or even multiple ping queues/threads to keep realtime heartbeat?
>>>>>>>>
>>>>>>>> THis is very serious issue with Zookeeper for our mission-critical
system. Could anyone
>>>>>>>> look into this?
>>>>>>>>
>>>>>>>> I will try to file a bug.
>>>>>>>>
>>>>>>>> Thank you.
>>>>>>>>
>>>>>>>> Chang
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>
>>>>>>
>>>>
>>>>
>>>
>
>

Mime
View raw message