Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id E917A200CE6 for ; Fri, 15 Sep 2017 19:09:05 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id E7C871609D4; Fri, 15 Sep 2017 17:09:05 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 3ADE81609C9 for ; Fri, 15 Sep 2017 19:09:05 +0200 (CEST) Received: (qmail 99711 invoked by uid 500); 15 Sep 2017 17:09:03 -0000 Mailing-List: contact dev-help@zookeeper.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@zookeeper.apache.org Delivered-To: mailing list dev@zookeeper.apache.org Received: (qmail 99700 invoked by uid 99); 15 Sep 2017 17:09:03 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd1-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 15 Sep 2017 17:09:03 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id CBED1C481B for ; Fri, 15 Sep 2017 17:09:02 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -100.002 X-Spam-Level: X-Spam-Status: No, score=-100.002 tagged_above=-999 required=6.31 tests=[RP_MATCHES_RCVD=-0.001, SPF_PASS=-0.001, USER_IN_WHITELIST=-100] autolearn=disabled Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024) with ESMTP id rY8PC4Usp2Ku for ; Fri, 15 Sep 2017 17:09:02 +0000 (UTC) Received: from mailrelay1-us-west.apache.org (mailrelay1-us-west.apache.org [209.188.14.139]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTP id F17C35F19D for ; Fri, 15 Sep 2017 17:09:01 +0000 (UTC) Received: from jira-lw-us.apache.org (unknown [207.244.88.139]) by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id 6F9F5E0288 for ; Fri, 15 Sep 2017 17:09:01 +0000 (UTC) Received: from jira-lw-us.apache.org (localhost [127.0.0.1]) by jira-lw-us.apache.org (ASF Mail Server at jira-lw-us.apache.org) with ESMTP id 0E6D325383 for ; Fri, 15 Sep 2017 17:09:01 +0000 (UTC) Date: Fri, 15 Sep 2017 17:09:01 +0000 (UTC) From: "Yicheng Fang (JIRA)" To: dev@zookeeper.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Updated] (ZOOKEEPER-2899) Zookeeper not receiving packets after ZXID overflows MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 archived-at: Fri, 15 Sep 2017 17:09:06 -0000 [ https://issues.apache.org/jira/browse/ZOOKEEPER-2899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yicheng Fang updated ZOOKEEPER-2899: ------------------------------------ Attachment: zk_20170309_wo_noise.log Aggregated log of the 5 node ensemble, minus noisy connection logs > Zookeeper not receiving packets after ZXID overflows > ---------------------------------------------------- > > Key: ZOOKEEPER-2899 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2899 > Project: ZooKeeper > Issue Type: Bug > Components: leaderElection > Affects Versions: 3.4.5 > Environment: 5 host ensemble, 1500+ client connections each, 300K+ nodes > OS: Ubuntu precise > JAVA 7 > JuniperQFX510048T NIC, 10000Mb/s, ixgbe driver > 6 core Intel(R)_Xeon(R)_CPU_E5-2620_v3_@_2.40GHz > 4 HDD 600G each > Reporter: Yicheng Fang > Attachments: image12.png, image13.png, zk_20170309_wo_noise.log > > > ZK was used with Kafka (version 0.10.0) for coordination. We had a lot of Kafka consumers writing consumption offsets to ZK. > We observed the issue two times within the last year. Each time after ZXID overflowed, ZK was not receiving packets even though leader election looked successful from the logs, and ZK servers were up. As a result, the whole Kafka system came to a halt. > As an attempt to reproduce (and hopefully fixing) the issue, I set up test ZK and Kafka clusters and feed them with like-production test traffic. Though not really able to reproduce the issue, I did see that the Kafka consumers, which used ZK clients, essentially DOSed the ensemble, filling up the `submittedRequests` in `PrepRequestProcessor`, causing even 100ms+ read latencies. > More details are included in the comments. -- This message was sent by Atlassian JIRA (v6.4.14#64029)