From dev-return-76130-archive-asf-public=cust-asf.ponee.io@zookeeper.apache.org Thu Nov 22 01:31:04 2018 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by mx-eu-01.ponee.io (Postfix) with SMTP id C91E9180668 for ; Thu, 22 Nov 2018 01:31:03 +0100 (CET) Received: (qmail 8227 invoked by uid 500); 22 Nov 2018 00:31:02 -0000 Mailing-List: contact dev-help@zookeeper.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@zookeeper.apache.org Delivered-To: mailing list dev@zookeeper.apache.org Received: (qmail 8213 invoked by uid 99); 22 Nov 2018 00:31:02 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd1-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 22 Nov 2018 00:31:02 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id 548ADD348F for ; Thu, 22 Nov 2018 00:31:02 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -109.501 X-Spam-Level: X-Spam-Status: No, score=-109.501 tagged_above=-999 required=6.31 tests=[ENV_AND_HDR_SPF_MATCH=-0.5, KAM_ASCII_DIVIDERS=0.8, RCVD_IN_DNSWL_MED=-2.3, SPF_PASS=-0.001, USER_IN_DEF_SPF_WL=-7.5, USER_IN_WHITELIST=-100] autolearn=disabled Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024) with ESMTP id t13tEOaFeXtV for ; Thu, 22 Nov 2018 00:31:01 +0000 (UTC) Received: from mailrelay1-us-west.apache.org (mailrelay1-us-west.apache.org [209.188.14.139]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTP id D702A5FB11 for ; Thu, 22 Nov 2018 00:31:00 +0000 (UTC) Received: from jira-lw-us.apache.org (unknown [207.244.88.139]) by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id 6B3A6E0E1D for ; Thu, 22 Nov 2018 00:31:00 +0000 (UTC) Received: from jira-lw-us.apache.org (localhost [127.0.0.1]) by jira-lw-us.apache.org (ASF Mail Server at jira-lw-us.apache.org) with ESMTP id 1D19C21361 for ; Thu, 22 Nov 2018 00:31:00 +0000 (UTC) Date: Thu, 22 Nov 2018 00:31:00 +0000 (UTC) From: "Michael K. Edwards (JIRA)" To: dev@zookeeper.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (ZOOKEEPER-2846) Leader follower sync with on disk txns can possibly leads to data inconsistency MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/ZOOKEEPER-2846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16695420#comment-16695420 ] Michael K. Edwards commented on ZOOKEEPER-2846: ----------------------------------------------- Does this need to be addressed (or release noted) for 3.5.5? > Leader follower sync with on disk txns can possibly leads to data inconsistency > ------------------------------------------------------------------------------- > > Key: ZOOKEEPER-2846 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2846 > Project: ZooKeeper > Issue Type: Bug > Components: quorum > Affects Versions: 3.4.10, 3.5.3, 3.6.0 > Reporter: Fangmin Lv > Priority: Critical > > On disk txn sync could cause data inconsistency if the current leader just had a snap sync before it became leader, and then having diff sync with its followers may synced the txns gap on disk. Here is scenario: > Let's say S0 - S3 are followers, and S4 is leader at the beginning: > 1. Stop S2 and send one more request > 2. Stop S3 and send more requests to the quorum to let S3 have a snap sync with S4 when it started up > 3. Stop S4 and S3 became the new leader > 4. Start S2 and had a diff sync with S3, now there are gaps in S2 > Attached the test case to verify the issue. Currently, there is no efficient way to check the gap in txn files is a real gap or due to Epoch change. We need to add that support, but before that, it would be safer to disable the on disk txn leader-follower sync. > Another two scenarios which could cause the same issue: > (Scenario 1) Server A, B, C, A is leader, the others are followers: > 1). A synced to disk, but the other 2 restarted before receiving the proposal > 2). B and C formed quorum, B is leader, and committed some requests > 3). A looking again, and sync with B, B won't able to trunc A but send snap instead, and leaves the extra txn in A's txn file > 4). A became new leader, and someone else has a diff sync with A it will have the extra txn > (Scenario 2) Diff sync with committed txn, will only apply to data tree but not on disk txn file, which will also leave hole in it and lead to data inconsistency issue when syncing with learners. -- This message was sent by Atlassian JIRA (v7.6.3#76005)