Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id C6D56200CCE for ; Sun, 23 Jul 2017 16:50:13 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id C3898164225; Sun, 23 Jul 2017 14:50:13 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 12F4416421F for ; Sun, 23 Jul 2017 16:50:12 +0200 (CEST) Received: (qmail 49463 invoked by uid 500); 23 Jul 2017 14:50:11 -0000 Mailing-List: contact dev-help@zookeeper.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@zookeeper.apache.org Delivered-To: mailing list dev@zookeeper.apache.org Received: (qmail 49452 invoked by uid 99); 23 Jul 2017 14:50:11 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd1-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 23 Jul 2017 14:50:11 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id 75F96C0EA5 for ; Sun, 23 Jul 2017 14:50:11 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -99.202 X-Spam-Level: X-Spam-Status: No, score=-99.202 tagged_above=-999 required=6.31 tests=[KAM_ASCII_DIVIDERS=0.8, RP_MATCHES_RCVD=-0.001, SPF_PASS=-0.001, USER_IN_WHITELIST=-100] autolearn=disabled Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024) with ESMTP id OE9NtYloAur4 for ; Sun, 23 Jul 2017 14:50:10 +0000 (UTC) Received: from mailrelay1-us-west.apache.org (mailrelay1-us-west.apache.org [209.188.14.139]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTP id 2F9625F6D3 for ; Sun, 23 Jul 2017 14:50:10 +0000 (UTC) Received: from jira-lw-us.apache.org (unknown [207.244.88.139]) by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id 23202E0DCA for ; Sun, 23 Jul 2017 14:50:08 +0000 (UTC) Received: from jira-lw-us.apache.org (localhost [127.0.0.1]) by jira-lw-us.apache.org (ASF Mail Server at jira-lw-us.apache.org) with ESMTP id 8BF3221EEB for ; Sun, 23 Jul 2017 14:50:03 +0000 (UTC) Date: Sun, 23 Jul 2017 14:50:03 +0000 (UTC) From: "Hadoop QA (JIRA)" To: dev@zookeeper.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (ZOOKEEPER-1669) Operations to server will be timed-out while thousands of sessions expired same time MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 archived-at: Sun, 23 Jul 2017 14:50:14 -0000 [ https://issues.apache.org/jira/browse/ZOOKEEPER-1669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16097650#comment-16097650 ] Hadoop QA commented on ZOOKEEPER-1669: -------------------------------------- +1 overall. GitHub Pull Request Build +1 @author. The patch does not contain any @author tags. +0 tests included. The patch appears to be a documentation patch that doesn't require tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 3.0.1) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-github-pr-build/895//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-github-pr-build/895//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-github-pr-build/895//console This message is automatically generated. > Operations to server will be timed-out while thousands of sessions expired same time > ------------------------------------------------------------------------------------ > > Key: ZOOKEEPER-1669 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1669 > Project: ZooKeeper > Issue Type: Improvement > Components: server > Affects Versions: 3.3.5 > Reporter: tokoot > Assignee: Cheney Sun > Labels: performance > > If there are thousands of clients, and most of them disconnect with server same time(client restarted or servers partitioned with clients), the server will busy to close those "connections" and become unavailable. The problem is in following: > private void closeSessionWithoutWakeup(long sessionId) { > HashSet cnxns; > synchronized (this.cnxns) { > cnxns = (HashSet)this.cnxns.clone(); // other thread will block because of here > } > ... > } > A real world example that demonstrated this problem (Kudos to [~sun.cheney]): > {noformat} > The issue is raised while tens thousands of clients try to reconnect ZooKeeper service. > Actually, we came across the issue during maintaining our HBase cluster, which used a 5-server ZooKeeper cluster. > The HBase cluster was composed of many many regionservers (in thousand order of magnitude), > and connected by tens thousands of clients to do massive reads/writes. > Because the r/w throughput is very high, ZooKeeper zxid increased quickly as well. > Basically, each two or three weeks, Zookeeper would make leader relection triggered by the zxid roll over. > The leader relection will cause the clients(HBase regionservers and HBase clients) disconnected > and reconnected with Zookeeper servers in the mean time, and try to renew the sessions. > In current implementation of session renew, NIOServerCnxnFactory will clone all the connections at first > in order to avoid race condition in multi-threads and go iterate the cloned connection set one by one to > find the related session to renew. It's very time consuming. In our case (described above), > it caused many region servers can't successfully renew session before session timeout, > and eventually the HBase cluster lose these region servers and affect the HBase stability. > The change is to make refactoring to the close session logic and introduce a ConcurrentHashMap > to store session id and connection map relation, which is a thread-safe data structure > and eliminate the necessary to clone the connection set at first. > {noformat} -- This message was sent by Atlassian JIRA (v6.4.14#64029)