Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 60E79200D13 for ; Fri, 15 Sep 2017 19:09:06 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 5FCF11609C9; Fri, 15 Sep 2017 17:09:06 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id A653E1609D2 for ; Fri, 15 Sep 2017 19:09:05 +0200 (CEST) Received: (qmail 99745 invoked by uid 500); 15 Sep 2017 17:09:04 -0000 Mailing-List: contact dev-help@zookeeper.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@zookeeper.apache.org Delivered-To: mailing list dev@zookeeper.apache.org Received: (qmail 99714 invoked by uid 99); 15 Sep 2017 17:09:04 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd2-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 15 Sep 2017 17:09:04 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id 99F771A6F63 for ; Fri, 15 Sep 2017 17:09:03 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -100.002 X-Spam-Level: X-Spam-Status: No, score=-100.002 tagged_above=-999 required=6.31 tests=[RP_MATCHES_RCVD=-0.001, SPF_PASS=-0.001, USER_IN_WHITELIST=-100] autolearn=disabled Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024) with ESMTP id yuAg8UmLlqvM for ; Fri, 15 Sep 2017 17:09:02 +0000 (UTC) Received: from mailrelay1-us-west.apache.org (mailrelay1-us-west.apache.org [209.188.14.139]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTP id 2965D5FCE9 for ; Fri, 15 Sep 2017 17:09:02 +0000 (UTC) Received: from jira-lw-us.apache.org (unknown [207.244.88.139]) by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id 73042E0D92 for ; Fri, 15 Sep 2017 17:09:01 +0000 (UTC) Received: from jira-lw-us.apache.org (localhost [127.0.0.1]) by jira-lw-us.apache.org (ASF Mail Server at jira-lw-us.apache.org) with ESMTP id 1028225385 for ; Fri, 15 Sep 2017 17:09:01 +0000 (UTC) Date: Fri, 15 Sep 2017 17:09:01 +0000 (UTC) From: "Yicheng Fang (JIRA)" To: dev@zookeeper.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Issue Comment Deleted] (ZOOKEEPER-2899) Zookeeper not receiving packets after ZXID overflows MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 archived-at: Fri, 15 Sep 2017 17:09:06 -0000 [ https://issues.apache.org/jira/browse/ZOOKEEPER-2899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yicheng Fang updated ZOOKEEPER-2899: ------------------------------------ Comment: was deleted (was: Aggregated log of the 5 node ensemble, minus noisy connection logs) > Zookeeper not receiving packets after ZXID overflows > ---------------------------------------------------- > > Key: ZOOKEEPER-2899 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2899 > Project: ZooKeeper > Issue Type: Bug > Components: leaderElection > Affects Versions: 3.4.5 > Environment: 5 host ensemble, 1500+ client connections each, 300K+ nodes > OS: Ubuntu precise > JAVA 7 > JuniperQFX510048T NIC, 10000Mb/s, ixgbe driver > 6 core Intel(R)_Xeon(R)_CPU_E5-2620_v3_@_2.40GHz > 4 HDD 600G each > Reporter: Yicheng Fang > Attachments: image12.png, image13.png, zk_20170309_wo_noise.log > > > ZK was used with Kafka (version 0.10.0) for coordination. We had a lot of Kafka consumers writing consumption offsets to ZK. > We observed the issue two times within the last year. Each time after ZXID overflowed, ZK was not receiving packets even though leader election looked successful from the logs, and ZK servers were up. As a result, the whole Kafka system came to a halt. > As an attempt to reproduce (and hopefully fixing) the issue, I set up test ZK and Kafka clusters and feed them with like-production test traffic. Though not really able to reproduce the issue, I did see that the Kafka consumers, which used ZK clients, essentially DOSed the ensemble, filling up the `submittedRequests` in `PrepRequestProcessor`, causing even 100ms+ read latencies. > More details are included in the comments. -- This message was sent by Atlassian JIRA (v6.4.14#64029)