Return-Path: Delivered-To: apmail-hadoop-zookeeper-user-archive@minotaur.apache.org Received: (qmail 34818 invoked from network); 1 Jul 2010 06:15:27 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 1 Jul 2010 06:15:27 -0000 Received: (qmail 93532 invoked by uid 500); 1 Jul 2010 06:15:27 -0000 Delivered-To: apmail-hadoop-zookeeper-user-archive@hadoop.apache.org Received: (qmail 93235 invoked by uid 500); 1 Jul 2010 06:15:24 -0000 Mailing-List: contact zookeeper-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: zookeeper-user@hadoop.apache.org Delivered-To: mailing list zookeeper-user@hadoop.apache.org Received: (qmail 93227 invoked by uid 99); 1 Jul 2010 06:15:22 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 01 Jul 2010 06:15:22 +0000 X-ASF-Spam-Status: No, hits=0.0 required=10.0 tests=FREEMAIL_FROM,RCVD_IN_DNSWL_NONE,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of traviscrawford@gmail.com designates 209.85.212.48 as permitted sender) Received: from [209.85.212.48] (HELO mail-vw0-f48.google.com) (209.85.212.48) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 01 Jul 2010 06:15:14 +0000 Received: by vws19 with SMTP id 19so2865336vws.35 for ; Wed, 30 Jun 2010 23:13:53 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:received:date:message-id :subject:from:to:content-type; bh=Oes1Q9Pi0GNzLQZQ0vYv3W91ysI85/cuhIx6wXdqs78=; b=gAMhZP9zgZU/E3tk8DzU2yXdApNmpwsCPqrioCLI1s2tqKD8BguCpCHJFRV8vKwRyV 34xMJJgLiL6ttyDAusCxZ9oAp/cQtu4l0q3Jhr4z4R62IPQ7BgebED6zvh+JDWNeg1rZ jLN8iDfJoYDvgrztA0wUidNE46Ity+k00n8cA= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:date:message-id:subject:from:to:content-type; b=WEJMFb488dOKMrQgrXZMol5p+HzwZ2eHeeIb0NphTeu9UxPLCjNZlbxuVBnvz1nwiB bf6cc+MokxjcjizHAmWU0yaqr03dceXCfCyrMlikcXPTOlvTPP2eWeEmkJmCtNbAgCWu E1iUvxn6S1wkuEwpb6G7JhpU92lr9oTP3Btd0= MIME-Version: 1.0 Received: by 10.229.182.9 with SMTP id ca9mr5870276qcb.118.1277964833778; Wed, 30 Jun 2010 23:13:53 -0700 (PDT) Received: by 10.229.225.65 with HTTP; Wed, 30 Jun 2010 23:13:53 -0700 (PDT) Date: Wed, 30 Jun 2010 23:13:53 -0700 Message-ID: Subject: Zookeeper outage recap & questions From: Travis Crawford To: zookeeper-user@hadoop.apache.org Content-Type: text/plain; charset=ISO-8859-1 X-Virus-Checked: Checked by ClamAV on apache.org Hey zookeepers - We just experienced a total zookeeper outage, and here's a quick post-mortem of the issue, and some questions about preventing it going forward. Quick overview of the setup: - RHEL5 2.6.18 kernel - Zookeeper 3.3.0 - ulimit raised to 65k files - 3 cluster members - 4-5k connections in steady-state - Primarily C and python clients, plus some java In chronological order, the issue manifested itself as alert about RW tests failing. Logs were full of too many files errors, and the output of netstat showed lots of CLOSE_WAIT and SYN_RECV sockets. CPU was 100%. Application logs showed lots of connection timeouts. This suggests an event happened that caused applications to dogpile on Zookeeper, and eventually the CLOSE_WAIT timeout caused file handles to run out and basically game over. I looked through lots of logs (clients+servers) and did not see a clear indication of what happened. Graphs show a sudden decrease in network traffic when the outage began, zookeeper goes cpu bound, and runs our of file descriptors. Clients are primarily a couple thousand C clients using default connection parameters, and a couple thousand python clients using default connection parameters. Digging through Jira we see two issues that probably contributed to this outage: https://issues.apache.org/jira/browse/ZOOKEEPER-662 https://issues.apache.org/jira/browse/ZOOKEEPER-517 Both are tagged for the 3.4.0 release. Anyone know if that's still the case, and when 3.4.0 is roughly scheduled to ship? Thanks! Travis