hadoop-zookeeper-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Flavio Junqueira <...@yahoo-inc.com>
Subject Re: Zookeeper outage recap & questions
Date Thu, 01 Jul 2010 06:29:41 GMT
Hi Travis, Do you think it would be possible for you to open a jira  
and upload your logs?

Thanks,
-Flavio

On Jul 1, 2010, at 8:13 AM, Travis Crawford wrote:

> Hey zookeepers -
>
> We just experienced a total zookeeper outage, and here's a quick
> post-mortem of the issue, and some questions about preventing it going
> forward. Quick overview of the setup:
>
> - RHEL5 2.6.18 kernel
> - Zookeeper 3.3.0
> - ulimit raised to 65k files
> - 3 cluster members
> - 4-5k connections in steady-state
> - Primarily C and python clients, plus some java
>
> In chronological order, the issue manifested itself as alert about RW
> tests failing. Logs were full of too many files errors, and the output
> of netstat showed lots of CLOSE_WAIT and SYN_RECV sockets. CPU was
> 100%. Application logs showed lots of connection timeouts. This
> suggests an event happened that caused applications to dogpile on
> Zookeeper, and eventually the CLOSE_WAIT timeout caused file handles
> to run out and basically game over.
>
> I looked through lots of logs (clients+servers) and did not see a
> clear indication of what happened. Graphs show a sudden decrease in
> network traffic when the outage began, zookeeper goes cpu bound, and
> runs our of file descriptors.
>
> Clients are primarily a couple thousand C clients using default
> connection parameters, and a couple thousand python clients using
> default connection parameters.
>
> Digging through Jira we see two issues that probably contributed to  
> this outage:
>
>    https://issues.apache.org/jira/browse/ZOOKEEPER-662
>    https://issues.apache.org/jira/browse/ZOOKEEPER-517
>
> Both are tagged for the 3.4.0 release. Anyone know if that's still the
> case, and when 3.4.0 is roughly scheduled to ship?
>
> Thanks!
> Travis

flavio
junqueira

research scientist

fpj@yahoo-inc.com
direct +34 93-183-8828

avinguda diagonal 177, 8th floor, barcelona, 08018, es
phone (408) 349 3300    fax (408) 349 3301




Mime
View raw message