Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id C7375200CCC for ; Fri, 21 Jul 2017 10:53:08 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id C5E5A16C64C; Fri, 21 Jul 2017 08:53:08 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 17C5816C648 for ; Fri, 21 Jul 2017 10:53:07 +0200 (CEST) Received: (qmail 19967 invoked by uid 500); 21 Jul 2017 08:53:06 -0000 Mailing-List: contact dev-help@zookeeper.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@zookeeper.apache.org Delivered-To: mailing list dev@zookeeper.apache.org Received: (qmail 19954 invoked by uid 99); 21 Jul 2017 08:53:05 -0000 Received: from git1-us-west.apache.org (HELO git1-us-west.apache.org) (140.211.11.23) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 21 Jul 2017 08:53:05 +0000 Received: by git1-us-west.apache.org (ASF Mail Server at git1-us-west.apache.org, from userid 33) id EBFACE0262; Fri, 21 Jul 2017 08:53:04 +0000 (UTC) From: CheneySun To: dev@zookeeper.apache.org Reply-To: dev@zookeeper.apache.org References: In-Reply-To: Subject: [GitHub] zookeeper issue #312: ZOOKEEPER-1669: Operations to server will be timed-out... Content-Type: text/plain Message-Id: <20170721085304.EBFACE0262@git1-us-west.apache.org> Date: Fri, 21 Jul 2017 08:53:04 +0000 (UTC) archived-at: Fri, 21 Jul 2017 08:53:09 -0000 Github user CheneySun commented on the issue: https://github.com/apache/zookeeper/pull/312 @hanm The issue is raised while tens thousands of clients try to reconnect ZooKeeper service. Actually, we came across the issue during maintaining our HBase cluster, which used a 5-server ZooKeeper cluster. The HBase cluster was composed of many many regionservers (in thousand order of magnitude), and connected by tens thousands of clients to do massive reads/writes. Because the r/w throughput is very high, ZooKeeper zxid increased quickly as well. Basically, each two or three weeks, Zookeeper would make leader relection triggered by the zxid roll over. The leader relection will cause the clients(HBase regionservers and HBase clients) disconnected and reconnected with Zookeeper servers in the mean time, and try to renew the sessions. In current implementation of session renew, NIOServerCnxnFactory will clone all the connections at first in order to avoid race condition in multi-threads and go iterate the cloned connection set one by one to find the related session to renew. It's very time consuming. In our case (described above), it caused many region servers can't successfully renew session before session timeout, and eventually the HBase cluster lose these region servers and affect the HBase stability. The change is to make refactoring to the close session logic and introduce a ConcurrentHashMap to store session id and connection map relation, which is a thread-safe data structure and eliminate the necessary to clone the connection set at first. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastructure@apache.org or file a JIRA ticket with INFRA. ---