Yes, 965 seconds is huge.
The times I've seen such huge latencies are (in order of frequency seen):
1) when the java process gc's, swaps, or both
and/or
2) disk utilization on the ZK server is high
and/or
3) under-provisioned virtual machines (ie vmware)
Re 2) in some cases we've seen users running multiple IO heavy processes
on the same host as the zk server. hdfs data nodes and/or hbase region
servers are two common examples. In those cases I've monitored "sudo
iostat -x 3", specifically the "%util", and when this case is occurring
I've seen heavy utilization concurrent with the max latency spiking up.
ZK servers need to sync a file to disk on every write operation, maxed
out disk utilization (esp on ext3 fs) will cause a spike in max latency
as the server has to wait for the sync to complete.
I notice you have ~1mil writes and ~1mil reads according to the current
server stats. It seems you are doing 1 read for every write (so
relatively low read to write ratio... not extremely so but something to
consider)
You might reset the latency using the "srst" 4 letter word
http://hadoop.apache.org/zookeeper/docs/current/zookeeperAdmin.html#sc_zkCommands
then monitor the max latency, the GC history, the "swappyness" of the
java process, and esp the disk utilization. You can correlate a jump in
latency against these (gc, swap, %util, etc...).
useful tool: http://github.com/phunt/zktop
Given that you've ruled out the gc (most common), disk utilization would
be the next thing to check.
Patrick
On 06/15/2010 11:20 AM, Jordan Zimmerman wrote:
> Yes - the session drop happened again. I did the stat. The max latency is huge (I assume
that's in ms).
>
> Zookeeper version: 3.3.0-925362, built on 03/19/2010 18:38 GMT
> Clients:
> /10.243.14.179:57300[1](queued=0,recved=0,sent=0)
> /207.111.236.2:51493[1](queued=0,recved=1,sent=0)
> /10.243.13.191:44414[1](queued=0,recved=0,sent=0)
> /10.243.13.191:44415[1](queued=0,recved=0,sent=0)
> /10.244.135.100:58965[1](queued=0,recved=0,sent=0)
> /10.243.15.192:45057[1](queued=0,recved=0,sent=0)
>
> Latency min/avg/max: 0/6/965759
> Received: 2203048
> Sent: 2202979
> Outstanding: 0
> Zxid: 0xef2b2
> Mode: standalone
> Node count: 347
>
>
> On Jun 15, 2010, at 11:02 AM, Ted Dunning wrote:
>
>> Jordan,
>>
>> Good step to get this info.
>>
>> I have to ask, did you have your disconnect problem last night as well?
>> (just checking)
>>
>> What does the stat command on ZK give you for each server?
>>
>> On Tue, Jun 15, 2010 at 10:33 AM, Jordan Zimmerman<
>> jzimmerman@proofpoint.com> wrote:
>>
>>> More on this...
>>>
>>> I ran last night with verbose GC on our client. I analyzed the GC log in
>>> gchisto and 99% of the GCs are 1 or 2 ms. The longest gc is 30 ms. On the
>>> Zookeeper server side, the longest gc is 130 ms. So, I submit, GC is not the
>>> problem. NOTE we're running on Amazon EC2.
>>>
>>>
>
|