From dev-return-67633-archive-asf-public=cust-asf.ponee.io@zookeeper.apache.org Fri Feb 16 21:06:06 2018 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by mx-eu-01.ponee.io (Postfix) with SMTP id B501A180647 for ; Fri, 16 Feb 2018 21:06:05 +0100 (CET) Received: (qmail 39834 invoked by uid 500); 16 Feb 2018 20:06:04 -0000 Mailing-List: contact dev-help@zookeeper.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@zookeeper.apache.org Delivered-To: mailing list dev@zookeeper.apache.org Received: (qmail 39812 invoked by uid 99); 16 Feb 2018 20:06:04 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd4-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 16 Feb 2018 20:06:04 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd4-us-west.apache.org (ASF Mail Server at spamd4-us-west.apache.org) with ESMTP id 16A64C046E for ; Fri, 16 Feb 2018 20:06:04 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd4-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -109.511 X-Spam-Level: X-Spam-Status: No, score=-109.511 tagged_above=-999 required=6.31 tests=[ENV_AND_HDR_SPF_MATCH=-0.5, KAM_ASCII_DIVIDERS=0.8, RCVD_IN_DNSWL_MED=-2.3, SPF_PASS=-0.001, T_RP_MATCHES_RCVD=-0.01, USER_IN_DEF_SPF_WL=-7.5, USER_IN_WHITELIST=-100] autolearn=disabled Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd4-us-west.apache.org [10.40.0.11]) (amavisd-new, port 10024) with ESMTP id k2m2ZIqrAVq0 for ; Fri, 16 Feb 2018 20:06:02 +0000 (UTC) Received: from mailrelay1-us-west.apache.org (mailrelay1-us-west.apache.org [209.188.14.139]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTP id 6AE995F520 for ; Fri, 16 Feb 2018 20:06:01 +0000 (UTC) Received: from jira-lw-us.apache.org (unknown [207.244.88.139]) by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id 9BC64E00C7 for ; Fri, 16 Feb 2018 20:06:00 +0000 (UTC) Received: from jira-lw-us.apache.org (localhost [127.0.0.1]) by jira-lw-us.apache.org (ASF Mail Server at jira-lw-us.apache.org) with ESMTP id 5999521E5A for ; Fri, 16 Feb 2018 20:06:00 +0000 (UTC) Date: Fri, 16 Feb 2018 20:06:00 +0000 (UTC) From: "ASF GitHub Bot (JIRA)" To: dev@zookeeper.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (ZOOKEEPER-2845) Data inconsistency issue due to retain database in leader election MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/ZOOKEEPER-2845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16367814#comment-16367814 ] ASF GitHub Bot commented on ZOOKEEPER-2845: ------------------------------------------- Github user afine commented on a diff in the pull request: https://github.com/apache/zookeeper/pull/453#discussion_r168857757 --- Diff: src/java/test/org/apache/zookeeper/server/quorum/QuorumPeerMainTest.java --- @@ -888,4 +888,127 @@ public void testWithOnlyMinSessionTimeout() throws Exception { maxSessionTimeOut, quorumPeer.getMaxSessionTimeout()); } + @Test + public void testTxnAheadSnapInRetainDB() throws Exception { + // 1. start up server and wait for leader election to finish + ClientBase.setupTestEnv(); + final int SERVER_COUNT = 3; + final int clientPorts[] = new int[SERVER_COUNT]; + StringBuilder sb = new StringBuilder(); + for (int i = 0; i < SERVER_COUNT; i++) { + clientPorts[i] = PortAssignment.unique(); + sb.append("server." + i + "=127.0.0.1:" + PortAssignment.unique() + ":" + PortAssignment.unique() + ";" + clientPorts[i] + "\n"); + } + String quorumCfgSection = sb.toString(); + + MainThread mt[] = new MainThread[SERVER_COUNT]; + ZooKeeper zk[] = new ZooKeeper[SERVER_COUNT]; + for (int i = 0; i < SERVER_COUNT; i++) { + mt[i] = new MainThread(i, clientPorts[i], quorumCfgSection); + mt[i].start(); + zk[i] = new ZooKeeper("127.0.0.1:" + clientPorts[i], ClientBase.CONNECTION_TIMEOUT, this); + } + + waitForAll(zk, States.CONNECTED); + + // we need to shutdown and start back up to make sure that the create session isn't the first transaction since + // that is rather innocuous. + for (int i = 0; i < SERVER_COUNT; i++) { + mt[i].shutdown(); + } + + waitForAll(zk, States.CONNECTING); + + for (int i = 0; i < SERVER_COUNT; i++) { + mt[i].start(); + // Recreate a client session since the previous session was not persisted. + zk[i] = new ZooKeeper("127.0.0.1:" + clientPorts[i], ClientBase.CONNECTION_TIMEOUT, this); + } + + waitForAll(zk, States.CONNECTED); + + // 2. kill all followers + int leader = -1; + Map outstanding = null; + for (int i = 0; i < SERVER_COUNT; i++) { + if (mt[i].main.quorumPeer.leader != null) { + leader = i; + outstanding = mt[leader].main.quorumPeer.leader.outstandingProposals; + // increase the tick time to delay the leader going to looking + mt[leader].main.quorumPeer.tickTime = 10000; + } + } + LOG.warn("LEADER {}", leader); + + for (int i = 0; i < SERVER_COUNT; i++) { + if (i != leader) { + mt[i].shutdown(); + } + } + + // 3. start up the followers to form a new quorum + for (int i = 0; i < SERVER_COUNT; i++) { + if (i != leader) { + mt[i].start(); + } + } + + // 4. wait one of the follower to be the leader + for (int i = 0; i < SERVER_COUNT; i++) { + if (i != leader) { + // Recreate a client session since the previous session was not persisted. + zk[i] = new ZooKeeper("127.0.0.1:" + clientPorts[i], ClientBase.CONNECTION_TIMEOUT, this); + waitForOne(zk[i], States.CONNECTED); + } + } + + // 5. send a create request to leader and make sure it's synced to disk, + // which means it acked from itself + try { + zk[leader].create("/zk" + leader, "zk".getBytes(), Ids.OPEN_ACL_UNSAFE, + CreateMode.PERSISTENT); + Assert.fail("create /zk" + leader + " should have failed"); + } catch (KeeperException e) { + } + + // just make sure that we actually did get it in process at the + // leader + Assert.assertTrue(outstanding.size() == 1); + Proposal p = (Proposal) outstanding.values().iterator().next(); + Assert.assertTrue(p.request.getHdr().getType() == OpCode.create); + + // make sure it has a chance to write it to disk + Thread.sleep(1000); --- End diff -- @revans2 take a look at `testElectionFraud`, specifically: https://github.com/apache/zookeeper/blob/master/src/java/test/org/apache/zookeeper/server/quorum/QuorumPeerMainTest.java#L383 and https://github.com/apache/zookeeper/blob/master/src/java/test/org/apache/zookeeper/server/quorum/QuorumPeerMainTest.java#L397 You can manually start and stop leader election, I think that may solve this problem. > Data inconsistency issue due to retain database in leader election > ------------------------------------------------------------------ > > Key: ZOOKEEPER-2845 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2845 > Project: ZooKeeper > Issue Type: Bug > Components: quorum > Affects Versions: 3.4.10, 3.5.3, 3.6.0 > Reporter: Fangmin Lv > Assignee: Robert Joseph Evans > Priority: Critical > > In ZOOKEEPER-2678, the ZKDatabase is retained to reduce the unavailable time during leader election. In ZooKeeper ensemble, it's possible that the snapshot is ahead of txn file (due to slow disk on the server, etc), or the txn file is ahead of snapshot due to no commit message being received yet. > If snapshot is ahead of txn file, since the SyncRequestProcessor queue will be drained during shutdown, the snapshot and txn file will keep consistent before leader election happening, so this is not an issue. > But if txn is ahead of snapshot, it's possible that the ensemble will have data inconsistent issue, here is the simplified scenario to show the issue: > Let's say we have a 3 servers in the ensemble, server A and B are followers, and C is leader, and all the snapshot and txn are up to T0: > 1. A new request reached to leader C to create Node N, and it's converted to txn T1 > 2. Txn T1 was synced to disk in C, but just before the proposal reaching out to the followers, A and B restarted, so the T1 didn't exist in A and B > 3. A and B formed a new quorum after restart, let's say B is the leader > 4. C changed to looking state due to no enough followers, it will sync with leader B with last Zxid T0, which will have an empty diff sync > 5. Before C take snapshot it restarted, it replayed the txns on disk which includes T1, now it will have Node N, but A and B doesn't have it. > Also I included the a test case to reproduce this issue consistently. > We have a totally different RetainDB version which will avoid this issue by doing consensus between snapshot and txn files before leader election, will submit for review. -- This message was sent by Atlassian JIRA (v7.6.3#76005)