Return-Path: Delivered-To: apmail-zookeeper-user-archive@www.apache.org Received: (qmail 60405 invoked from network); 14 Apr 2011 17:59:43 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 14 Apr 2011 17:59:43 -0000 Received: (qmail 30053 invoked by uid 500); 14 Apr 2011 17:59:43 -0000 Delivered-To: apmail-zookeeper-user-archive@zookeeper.apache.org Received: (qmail 30026 invoked by uid 500); 14 Apr 2011 17:59:43 -0000 Mailing-List: contact user-help@zookeeper.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@zookeeper.apache.org Delivered-To: mailing list user@zookeeper.apache.org Received: (qmail 30018 invoked by uid 500); 14 Apr 2011 17:59:43 -0000 Delivered-To: apmail-hadoop-zookeeper-user@hadoop.apache.org Received: (qmail 30015 invoked by uid 99); 14 Apr 2011 17:59:43 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 14 Apr 2011 17:59:43 +0000 X-ASF-Spam-Status: No, hits=-2000.0 required=5.0 tests=ALL_TRUSTED,T_RP_MATCHES_RCVD X-Spam-Check-By: apache.org Received: from [140.211.11.9] (HELO minotaur.apache.org) (140.211.11.9) by apache.org (qpsmtpd/0.29) with SMTP; Thu, 14 Apr 2011 17:59:40 +0000 Received: (qmail 60290 invoked by uid 99); 14 Apr 2011 17:59:17 -0000 Received: from localhost.apache.org (HELO mail-iy0-f176.google.com) (127.0.0.1) (smtp-auth username breed, mechanism plain) by minotaur.apache.org (qpsmtpd/0.29) with ESMTP; Thu, 14 Apr 2011 17:59:17 +0000 Received: by iym1 with SMTP id 1so2458720iym.35 for ; Thu, 14 Apr 2011 10:59:17 -0700 (PDT) MIME-Version: 1.0 Received: by 10.42.241.137 with SMTP id le9mr1596019icb.177.1302803957005; Thu, 14 Apr 2011 10:59:17 -0700 (PDT) Received: by 10.42.172.6 with HTTP; Thu, 14 Apr 2011 10:59:16 -0700 (PDT) In-Reply-To: References: <5E0007CB-D9FA-42D1-9D41-25DD27B20C43@me.com> <2AAFE07C-9586-4E19-A608-9710A4154640@me.com> <66AAB7DE-7513-45E9-AA46-EDBC97629A50@me.com> Date: Thu, 14 Apr 2011 10:59:16 -0700 Message-ID: Subject: Re: Serious problem processing hearbeat on login stampede From: Benjamin Reed To: user@zookeeper.apache.org Cc: Patrick Hunt , Chang Song , zookeeper-user@hadoop.apache.org Content-Type: text/plain; charset=EUC-KR Content-Transfer-Encoding: quoted-printable X-Virus-Checked: Checked by ClamAV on apache.org chang, if the problem is on client startup, then it isn't the heartbeat stamped, it is session establishment. the heartbeats are very light weight, so i can't imagine them causing any issues. the two key issues we need to know are: 1) the version of the server you are running, and 2) if you are using a dedicated device for the transaction log. ben 2011/4/14 Patrick Hunt : > 2011/4/14 Chang Song : >>> 2) regarding IO, if you run 'iostat -x 2' on the zk servers while your >>> issue is happening, what's the %util of the disk? what's the iowait >>> look like? >>> >> >> Again, no I/O at all. 0% >> > > This is simply not possible. > > Sessions are persistent. Each time a session is created, and each time > it is closed, a transaction is written by the zk server to the data > directory. Additionally log4j based logs are also being streamed to > the disk. Each of these activities will cause disk IO that will show > up on iostat. > >> Patrick. They are not continuously login/logout. >> Maybe a couple of times a week. and before they push new feature. >> When this happens, clients in group A drops out of clusters, which cause= s >> problem to other unrelated services. >> > > Ok, good to know. > >> >> It is not about use case, because ZK clients simply tried to connect to >> ZK ensemble. No use case applies. Just many clients login at the >> same time or expires at the same time or close session at the same time. >> > > As I mentioned, I've seen cluster sizes of 10,000 clients (10x what > you report) that didn't have this issue. While bugs might be lurking, > I've also worked with many teams deploying clusters (probably close to > 100 by now), some of which had problems, the suggestions I'm making to > you are based on that experience. > >> Heartbeats should be handled in an isolated queue and a >> dedicated thread. I don't think we need strict ordering keeping >> of heartbeats, do we? > > ZK is purposely architected this way, it is not a mistake/bug. It is a > falicy for a highly available service to respond quickly to a > heartbeat when it cannot service regular requests in a timely fashion. > This is one of the main reasons why heartbeats are handled in this > way. > > Patrick > >>> Patrick >>> >>>> It's about CommitProcessor thread queueing (in leader). >>>> QueuedRequests goes up to 800, so does commitedRequests and >>>> PendingRequestElapsedTime. PendingRequestElapsedTime >>>> goes up to 8.8 seconds during this flood. >>>> >>>> To exactly reproduce this scenario, easiest way is to >>>> >>>> - suspend All JVM client with debugger >>>> - Cause all client JVM OOME to create heap dump >>>> >>>> in group B. All clients in group A will not be able to receive >>>> ping response in 5 seconds. >>>> >>>> We need to fix this as soon as possible. >>>> What we do as a workaround is to raise sessionTimeout to 40 sec. >>>> At least clients in Group A survives. But this increases >>>> our cluster failover time significantly. >>>> >>>> Thank you, Patrick. >>>> >>>> >>>> ps. We actually push ping request to FinalRequestProcessor as soon >>>> as the packet identifies itself as ping. No dice. >>>> >>>> >>>> >>>> 2011. 4. 14., =BF=C0=C0=FC 12:21, Patrick Hunt =C0=DB=BC=BA: >>>> >>>>> Hi Chang, it sounds like you may have an issue with your cluster >>>>> environment/setup, or perhaps a resource (GC/mem) issue. Have you >>>>> looked through the troubleshooting guide? >>>>> https://cwiki.apache.org/confluence/display/ZOOKEEPER/Troubleshooting >>>>> >>>>> In particular 1000 clients connecting should be fine, I've personally >>>>> seen clusters of 7-10 thousand clients. Keep in mind that each sessio= n >>>>> establishment is essentially a write (so the quorum in involved) and >>>>> what we typically see there is that the cluster configuration has >>>>> issues. 14 seconds for a ping response is huge and indicates one of >>>>> the following may be an underlying cause: >>>>> >>>>> 1) are you running in a virtualized environment? >>>>> 2) are you co-locating other services on the same host(s) that make u= p >>>>> the ZK serving cluster? >>>>> 3) have you followed the admin guide's "things to avoid"? >>>>> http://zookeeper.apache.org/doc/r3.3.3/zookeeperAdmin.html#sc_commonP= roblems >>>>> In particular ensuring that you are not swapping or going into gc >>>>> pause (both on the server and the client) >>>>> a) try turning on GC logging and ensure that you are not going into G= C >>>>> pause, see the troubleshooting guide, this is the most common cause o= f >>>>> high latency for the clients >>>>> b) ensure that you are not swapping >>>>> c) ensure that other processes are not causing log writing >>>>> (transactional logging) to be slow. >>>>> >>>>> Patrick >>>>> >>>>> On Wed, Apr 13, 2011 at 6:35 AM, Chang Song wrote: >>>>>> Hello, folks. >>>>>> >>>>>> We have ran into a very serious issue with Zookeeper. >>>>>> Here's a brief scenario. >>>>>> >>>>>> We have some Zookeeper clients with session timeout of 15 sec (thus = 5 sec ping), let's called >>>>>> these clients, group A. >>>>>> >>>>>> Now 1000 new clients (let's call these, group B) starts up at the sa= me time trying to >>>>>> connect to a three-node ZK ensemble, creating ZK createSession stamp= ede. >>>>>> >>>>>> Now almost all clients in group A is not able to exchange ping withi= n session expire time (15 sec). >>>>>> Thus clients in group A drops out of the cluster. >>>>>> >>>>>> We have looked into this issue a bit, found mostly synchronous natur= e of session queue processing. >>>>>> Latency between ping request and response ranges from 10ms up to 14 = seconds during this login stampede. >>>>>> >>>>>> Since session timeout is serious matter for our cluster, thus ping s= hould be done in psuedo realtime fashion. >>>>>> >>>>>> I don't know exactly how these ping timeout policy in clients and se= rver, but failure to receive ping >>>>>> response in clients due to zookeeper login session seem very nonsens= e to me. >>>>>> >>>>>> Shouldn't we have a separate ping/heartbeat queue and thread? >>>>>> Or even multiple ping queues/threads to keep realtime heartbeat? >>>>>> >>>>>> THis is very serious issue with Zookeeper for our mission-critical s= ystem. Could anyone >>>>>> look into this? >>>>>> >>>>>> I will try to file a bug. >>>>>> >>>>>> Thank you. >>>>>> >>>>>> Chang >>>>>> >>>>>> >>>>>> >>>> >>>> >> >> >