hadoop-zookeeper-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Patrick Hunt <ph...@apache.org>
Subject Re: Debugging help for SessionExpiredException
Date Tue, 15 Jun 2010 19:30:09 GMT
Yes, 965 seconds is huge.

The times I've seen such huge latencies are (in order of frequency seen):

1) when the java process gc's, swaps, or both

and/or

2) disk utilization on the ZK server is high

and/or

3) under-provisioned virtual machines (ie vmware)

Re 2) in some cases we've seen users running multiple IO heavy processes 
on the same host as the zk server. hdfs data nodes and/or hbase region 
servers are two common examples. In those cases I've monitored "sudo 
iostat -x 3", specifically the "%util", and when this case is occurring 
I've seen heavy utilization concurrent with the max latency spiking up. 
ZK servers need to sync a file to disk on every write operation, maxed 
out disk utilization (esp on ext3 fs) will cause a spike in max latency 
as the server has to wait for the sync to complete.

I notice you have ~1mil writes and ~1mil reads according to the current 
server stats. It seems you are doing 1 read for every write (so 
relatively low read to write ratio... not extremely so but something to 
consider)

You might reset the latency using the "srst" 4 letter word
http://hadoop.apache.org/zookeeper/docs/current/zookeeperAdmin.html#sc_zkCommands
then monitor the max latency, the GC history, the "swappyness" of the 
java process, and esp the disk utilization. You can correlate a jump in 
latency against these (gc, swap, %util, etc...).

useful tool: http://github.com/phunt/zktop

Given that you've ruled out the gc (most common), disk utilization would 
be the next thing to check.

Patrick

On 06/15/2010 11:20 AM, Jordan Zimmerman wrote:
> Yes - the session drop happened again. I did the stat. The max latency is huge (I assume
that's in ms).
>
> Zookeeper version: 3.3.0-925362, built on 03/19/2010 18:38 GMT
> Clients:
>   /10.243.14.179:57300[1](queued=0,recved=0,sent=0)
>   /207.111.236.2:51493[1](queued=0,recved=1,sent=0)
>   /10.243.13.191:44414[1](queued=0,recved=0,sent=0)
>   /10.243.13.191:44415[1](queued=0,recved=0,sent=0)
>   /10.244.135.100:58965[1](queued=0,recved=0,sent=0)
>   /10.243.15.192:45057[1](queued=0,recved=0,sent=0)
>
> Latency min/avg/max: 0/6/965759
> Received: 2203048
> Sent: 2202979
> Outstanding: 0
> Zxid: 0xef2b2
> Mode: standalone
> Node count: 347
>
>
> On Jun 15, 2010, at 11:02 AM, Ted Dunning wrote:
>
>> Jordan,
>>
>> Good step to get this info.
>>
>> I have to ask, did you have your disconnect problem last night as well?
>> (just checking)
>>
>> What does the stat command on ZK give you for each server?
>>
>> On Tue, Jun 15, 2010 at 10:33 AM, Jordan Zimmerman<
>> jzimmerman@proofpoint.com>  wrote:
>>
>>> More on this...
>>>
>>> I ran last night with verbose GC on our client. I analyzed the GC log in
>>> gchisto and 99% of the GCs are 1 or 2 ms. The longest gc is 30 ms. On the
>>> Zookeeper server side, the longest gc is 130 ms. So, I submit, GC is not the
>>> problem. NOTE we're running on Amazon EC2.
>>>
>>>
>

Mime
View raw message