Mailing-List: contact dev-help@zookeeper.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@zookeeper.apache.org
Date: Fri, 15 Sep 2017 17:09:01 +0000 (UTC)
From: "Yicheng Fang (JIRA)" <jira@apache.org>
To: dev@zookeeper.apache.org
Message-ID: <JIRA.13102441.1505437219000.124854.1505495341056@Atlassian.JIRA>
In-Reply-To: <JIRA.13102441.1505437219000@Atlassian.JIRA>
References: <JIRA.13102441.1505437219000@Atlassian.JIRA> <JIRA.13102441.1505437219741@jira-lw-us.apache.org>
Subject: [jira] [Updated] (ZOOKEEPER-2899) Zookeeper not receiving packets
 after ZXID overflows
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit
archived-at: Fri, 15 Sep 2017 17:09:06 -0000


     [ https://issues.apache.org/jira/browse/ZOOKEEPER-2899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Yicheng Fang updated ZOOKEEPER-2899:
------------------------------------
    Attachment: zk_20170309_wo_noise.log

Aggregated log of the 5 node ensemble, minus noisy connection logs

> Zookeeper not receiving packets after ZXID overflows
> ----------------------------------------------------
>
>                 Key: ZOOKEEPER-2899
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2899
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: leaderElection
>    Affects Versions: 3.4.5
>         Environment: 5 host ensemble, 1500+ client connections each, 300K+ nodes
> OS: Ubuntu precise
> JAVA 7
> JuniperQFX510048T NIC, 10000Mb/s, ixgbe driver
> 6 core Intel(R)_Xeon(R)_CPU_E5-2620_v3_@_2.40GHz
> 4 HDD 600G each 
>            Reporter: Yicheng Fang
>         Attachments: image12.png, image13.png, zk_20170309_wo_noise.log
>
>
> ZK was used with Kafka (version 0.10.0) for coordination. We had a lot of Kafka consumers writing  consumption offsets to ZK.
> We observed the issue two times within the last year. Each time after ZXID overflowed, ZK was not receiving packets even though leader election looked successful from the logs, and ZK servers were up. As a result, the whole Kafka system came to a halt.
> As an attempt to reproduce (and hopefully fixing) the issue, I set up test ZK and Kafka clusters and feed them with like-production test traffic. Though not really able to reproduce the issue, I did see that the Kafka consumers, which used ZK clients, essentially DOSed the ensemble, filling up the `submittedRequests` in `PrepRequestProcessor`, causing even 100ms+ read latencies.
> More details are included in the comments.


--
This message was sent by Atlassian JIRA
(v6.4.14#64029)