Mailing-List: contact user-help@zookeeper.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@zookeeper.apache.org
MIME-Version: 1.0
In-Reply-To: <BANLkTikK_HVzB3h_MBqw6VQSfjhjYU1iOg@mail.gmail.com>
References: <5E0007CB-D9FA-42D1-9D41-25DD27B20C43@me.com>
	<BANLkTi=FpkbM-rBuzeK21UWve_dJpruwSg@mail.gmail.com>
	<2AAFE07C-9586-4E19-A608-9710A4154640@me.com>
	<BANLkTinDSM6nz8+rOpwcR0mT6YZUcpFtBg@mail.gmail.com>
	<66AAB7DE-7513-45E9-AA46-EDBC97629A50@me.com>
	<BANLkTikK_HVzB3h_MBqw6VQSfjhjYU1iOg@mail.gmail.com>
Date: Thu, 14 Apr 2011 10:59:16 -0700
Message-ID: <BANLkTimpNxUMX19FfwbKTa3eZXaphmgi=w@mail.gmail.com>
Subject: Re: Serious problem processing hearbeat on login stampede
From: Benjamin Reed <breed@apache.org>
To: user@zookeeper.apache.org
Cc: Patrick Hunt <phunt@apache.org>, Chang Song <tru64ufs@me.com>,
 zookeeper-user@hadoop.apache.org
Content-Type: text/plain; charset=EUC-KR
Content-Transfer-Encoding: quoted-printable

chang,

if the problem is on client startup, then it isn't the heartbeat
stamped, it is session establishment. the heartbeats are very light
weight, so i can't imagine them causing any issues.

the two key issues we need to know are: 1) the version of the server
you are running, and 2) if you are using a dedicated device for the
transaction log.

ben

2011/4/14 Patrick Hunt <phunt@apache.org>:
> 2011/4/14 Chang Song <tru64ufs@me.com>:
>>> 2) regarding IO, if you run 'iostat -x 2' on the zk servers while your
>>> issue is happening, what's the %util of the disk? what's the iowait
>>> look like?
>>>
>>
>> Again, no I/O at all.   0%
>>
>
> This is simply not possible.
>
> Sessions are persistent. Each time a session is created, and each time
> it is closed, a transaction is written by the zk server to the data
> directory. Additionally log4j based logs are also being streamed to
> the disk. Each of these activities will cause disk IO that will show
> up on iostat.
>
>> Patrick. They are not continuously login/logout.
>> Maybe a couple of times a week. and before they push new feature.
>> When this happens, clients in group A drops out of clusters, which cause=
s
>> problem to other unrelated services.
>>
>
> Ok, good to know.
>
>>
>> It is not about use case, because ZK clients simply tried to connect to
>> ZK ensemble. No use case applies. Just many clients login at the
>> same time or expires at the same time or close session at the same time.
>>
>
> As I mentioned, I've seen cluster sizes of 10,000 clients (10x what
> you report) that didn't have this issue. While bugs might be lurking,
> I've also worked with many teams deploying clusters (probably close to
> 100 by now), some of which had problems, the suggestions I'm making to
> you are based on that experience.
>
>> Heartbeats should be handled in an isolated queue and a
>> dedicated thread.  I don't think we need strict ordering keeping
>> of heartbeats, do we?
>
> ZK is purposely architected this way, it is not a mistake/bug. It is a
> falicy for a highly available service to respond quickly to a
> heartbeat when it cannot service regular requests in a timely fashion.
> This is one of the main reasons why heartbeats are handled in this
> way.
>
> Patrick
>
>>> Patrick
>>>
>>>> It's about CommitProcessor thread queueing (in leader).
>>>> QueuedRequests goes up to 800, so does commitedRequests and
>>>> PendingRequestElapsedTime. PendingRequestElapsedTime
>>>> goes up to 8.8 seconds during this flood.
>>>>
>>>> To exactly reproduce this scenario, easiest way is to
>>>>
>>>> - suspend All JVM client with debugger
>>>> - Cause all client JVM OOME to create heap dump
>>>>
>>>> in group B. All clients in group A will not be able to receive
>>>> ping response in 5 seconds.
>>>>
>>>> We need to fix this as soon as possible.
>>>> What we do as a workaround is to raise sessionTimeout to 40 sec.
>>>> At least clients in Group A survives. But this increases
>>>> our cluster failover time significantly.
>>>>
>>>> Thank you, Patrick.
>>>>
>>>>
>>>> ps. We actually push ping request to FinalRequestProcessor as soon
>>>>     as the packet identifies itself as ping. No dice.
>>>>
>>>>
>>>>
>>>> 2011. 4. 14., =BF=C0=C0=FC 12:21, Patrick Hunt =C0=DB=BC=BA:
>>>>
>>>>> Hi Chang, it sounds like you may have an issue with your cluster
>>>>> environment/setup, or perhaps a resource (GC/mem) issue. Have you
>>>>> looked through the troubleshooting guide?
>>>>> https://cwiki.apache.org/confluence/display/ZOOKEEPER/Troubleshooting
>>>>>
>>>>> In particular 1000 clients connecting should be fine, I've personally
>>>>> seen clusters of 7-10 thousand clients. Keep in mind that each sessio=
n
>>>>> establishment is essentially a write (so the quorum in involved) and
>>>>> what we typically see there is that the cluster configuration has
>>>>> issues. 14 seconds for a ping response is huge and indicates one of
>>>>> the following may be an underlying cause:
>>>>>
>>>>> 1) are you running in a virtualized environment?
>>>>> 2) are you co-locating other services on the same host(s) that make u=
p
>>>>> the ZK serving cluster?
>>>>> 3) have you followed the admin guide's "things to avoid"?
>>>>> http://zookeeper.apache.org/doc/r3.3.3/zookeeperAdmin.html#sc_commonP=
roblems
>>>>> In particular ensuring that you are not swapping or going into gc
>>>>> pause (both on the server and the client)
>>>>> a) try turning on GC logging and ensure that you are not going into G=
C
>>>>> pause, see the troubleshooting guide, this is the most common cause o=
f
>>>>> high latency for the clients
>>>>> b) ensure that you are not swapping
>>>>> c) ensure that other processes are not causing log writing
>>>>> (transactional logging) to be slow.
>>>>>
>>>>> Patrick
>>>>>
>>>>> On Wed, Apr 13, 2011 at 6:35 AM, Chang Song <tru64ufs@me.com> wrote:
>>>>>> Hello, folks.
>>>>>>
>>>>>> We have ran into a very serious issue with Zookeeper.
>>>>>> Here's a brief scenario.
>>>>>>
>>>>>> We have some Zookeeper clients with session timeout of 15 sec (thus =
5 sec ping), let's called
>>>>>> these clients, group A.
>>>>>>
>>>>>> Now 1000 new clients (let's call these, group B) starts up at the sa=
me time trying to
>>>>>> connect to a three-node ZK ensemble, creating ZK createSession stamp=
ede.
>>>>>>
>>>>>> Now almost all clients in group A is not able to exchange ping withi=
n session expire time (15 sec).
>>>>>> Thus clients in group A drops out of the cluster.
>>>>>>
>>>>>> We have looked into this issue a bit, found mostly synchronous natur=
e of session queue processing.
>>>>>> Latency between ping request and response ranges from 10ms up to 14 =
seconds during this login stampede.
>>>>>>
>>>>>> Since session timeout is serious matter for our cluster, thus ping s=
hould be done in psuedo realtime fashion.
>>>>>>
>>>>>> I don't know exactly how these ping timeout policy in clients and se=
rver, but failure to receive ping
>>>>>> response in clients due to zookeeper login session seem very nonsens=
e to me.
>>>>>>
>>>>>> Shouldn't we have a separate ping/heartbeat queue and thread?
>>>>>> Or even multiple ping queues/threads to keep realtime heartbeat?
>>>>>>
>>>>>> THis is very serious issue with Zookeeper for our mission-critical s=
ystem. Could anyone
>>>>>> look into this?
>>>>>>
>>>>>> I will try to file a bug.
>>>>>>
>>>>>> Thank you.
>>>>>>
>>>>>> Chang
>>>>>>
>>>>>>
>>>>>>
>>>>
>>>>
>>
>>
>