Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id E513D200C17 for ; Fri, 10 Feb 2017 16:05:49 +0100 (CET) Received: by cust-asf.ponee.io (Postfix) id E3DF7160B69; Fri, 10 Feb 2017 15:05:49 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 40483160B5B for ; Fri, 10 Feb 2017 16:05:49 +0100 (CET) Received: (qmail 86270 invoked by uid 500); 10 Feb 2017 15:05:48 -0000 Mailing-List: contact dev-help@zookeeper.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@zookeeper.apache.org Delivered-To: mailing list dev@zookeeper.apache.org Received: (qmail 86195 invoked by uid 99); 10 Feb 2017 15:05:47 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd4-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 10 Feb 2017 15:05:47 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd4-us-west.apache.org (ASF Mail Server at spamd4-us-west.apache.org) with ESMTP id 66C5DC0027 for ; Fri, 10 Feb 2017 15:05:47 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd4-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -1.199 X-Spam-Level: X-Spam-Status: No, score=-1.199 tagged_above=-999 required=6.31 tests=[KAM_ASCII_DIVIDERS=0.8, KAM_LAZY_DOMAIN_SECURITY=1, RP_MATCHES_RCVD=-2.999] autolearn=disabled Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd4-us-west.apache.org [10.40.0.11]) (amavisd-new, port 10024) with ESMTP id wmrOTJ1AHE0N for ; Fri, 10 Feb 2017 15:05:45 +0000 (UTC) Received: from mailrelay1-us-west.apache.org (mailrelay1-us-west.apache.org [209.188.14.139]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTP id 16D325FE3A for ; Fri, 10 Feb 2017 15:05:45 +0000 (UTC) Received: from jira-lw-us.apache.org (unknown [207.244.88.139]) by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id ADD4DE063B for ; Fri, 10 Feb 2017 15:05:43 +0000 (UTC) Received: from jira-lw-us.apache.org (localhost [127.0.0.1]) by jira-lw-us.apache.org (ASF Mail Server at jira-lw-us.apache.org) with ESMTP id 51A3821D7D for ; Fri, 10 Feb 2017 15:05:42 +0000 (UTC) Date: Fri, 10 Feb 2017 15:05:42 +0000 (UTC) From: "ASF GitHub Bot (JIRA)" To: dev@zookeeper.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (ZOOKEEPER-2678) Large databases take a long time to regain a quorum MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 archived-at: Fri, 10 Feb 2017 15:05:50 -0000 [ https://issues.apache.org/jira/browse/ZOOKEEPER-2678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15861379#comment-15861379 ] ASF GitHub Bot commented on ZOOKEEPER-2678: ------------------------------------------- Github user revans2 commented on a diff in the pull request: https://github.com/apache/zookeeper/pull/159#discussion_r100552192 --- Diff: src/java/main/org/apache/zookeeper/server/quorum/Learner.java --- @@ -498,14 +504,19 @@ else if (qp.getType() == Leader.SNAP) { throw new Exception("changes proposed in reconfig"); } } - if (!snapshotTaken) { // true for the pre v1.0 case - zk.takeSnapshot(); + if (isPreZAB1_0) { + zk.takeSnapshot(); self.setCurrentEpoch(newEpoch); } self.setZooKeeperServer(zk); self.adminServer.setZooKeeperServer(zk); break outerLoop; - case Leader.NEWLEADER: // it will be NEWLEADER in v1.0 + case Leader.NEWLEADER: // Getting NEWLEADER here instead of in discovery + // means this is Zab 1.0 + // Create updatingEpoch file and remove it after current --- End diff -- You are right will fix that. > Large databases take a long time to regain a quorum > --------------------------------------------------- > > Key: ZOOKEEPER-2678 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2678 > Project: ZooKeeper > Issue Type: Bug > Components: server > Affects Versions: 3.4.9, 3.5.2 > Reporter: Robert Joseph Evans > Assignee: Robert Joseph Evans > > I know this is long but please here me out. > I recently inherited a massive zookeeper ensemble. The snapshot is 3.4 GB on disk. Because of its massive size we have been running into a number of issues. There are lots of problems that we hope to fix with tuning GC etc, but the big one right now that is blocking us making a lot of progress on the rest of them is that when we lose a quorum because the leader left, for what ever reason, it can take well over 5 mins for a new quorum to be established. So we cannot tune the leader without risking downtime. > We traced down where the time was being spent and found that each server was clearing the database so it would be read back in again before leader election even started. Then as part of the sync phase each server will write out a snapshot to checkpoint the progress it made as part of the sync. > I will be putting up a patch shortly with some proposed changes in it. -- This message was sent by Atlassian JIRA (v6.3.15#6346)