Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id E929E200CF5 for ; Sun, 13 Aug 2017 07:20:09 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id E5C36164873; Sun, 13 Aug 2017 05:20:09 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 3833216487D for ; Sun, 13 Aug 2017 07:20:09 +0200 (CEST) Received: (qmail 44113 invoked by uid 500); 13 Aug 2017 05:20:08 -0000 Mailing-List: contact dev-help@zookeeper.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@zookeeper.apache.org Delivered-To: mailing list dev@zookeeper.apache.org Received: (qmail 43941 invoked by uid 99); 13 Aug 2017 05:20:08 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd1-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 13 Aug 2017 05:20:08 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id 6CFE7C24E0 for ; Sun, 13 Aug 2017 05:20:07 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -99.202 X-Spam-Level: X-Spam-Status: No, score=-99.202 tagged_above=-999 required=6.31 tests=[KAM_ASCII_DIVIDERS=0.8, RP_MATCHES_RCVD=-0.001, SPF_PASS=-0.001, USER_IN_WHITELIST=-100] autolearn=disabled Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024) with ESMTP id c4I5d8PGIbmi for ; Sun, 13 Aug 2017 05:20:04 +0000 (UTC) Received: from mailrelay1-us-west.apache.org (mailrelay1-us-west.apache.org [209.188.14.139]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTP id 66C575F6D2 for ; Sun, 13 Aug 2017 05:20:04 +0000 (UTC) Received: from jira-lw-us.apache.org (unknown [207.244.88.139]) by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id 8B5E2E06CF for ; Sun, 13 Aug 2017 05:20:03 +0000 (UTC) Received: from jira-lw-us.apache.org (localhost [127.0.0.1]) by jira-lw-us.apache.org (ASF Mail Server at jira-lw-us.apache.org) with ESMTP id A3BCB2140D for ; Sun, 13 Aug 2017 05:20:01 +0000 (UTC) Date: Sun, 13 Aug 2017 05:20:01 +0000 (UTC) From: "ASF GitHub Bot (JIRA)" To: dev@zookeeper.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (ZOOKEEPER-2872) Interrupted snapshot sync causes data loss MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 archived-at: Sun, 13 Aug 2017 05:20:10 -0000 [ https://issues.apache.org/jira/browse/ZOOKEEPER-2872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16124803#comment-16124803 ] ASF GitHub Bot commented on ZOOKEEPER-2872: ------------------------------------------- Github user hanm commented on a diff in the pull request: https://github.com/apache/zookeeper/pull/333#discussion_r132832594 --- Diff: src/java/main/org/apache/zookeeper/server/quorum/Learner.java --- @@ -364,6 +364,7 @@ protected void syncWithLeader(long newLeaderZxid) throws Exception{ readPacket(qp); LinkedList packetsCommitted = new LinkedList(); LinkedList packetsNotCommitted = new LinkedList(); + boolean syncSnapshot = false; --- End diff -- We can level this variable definition up so it's clustered with `snapshotNeed` boolean. Another possibility is to get ride of this variable and use existing `snapshotNeeded` - but that will do fysnc snapshot for TRUNC sync, which the existing patch will not. > Interrupted snapshot sync causes data loss > ------------------------------------------ > > Key: ZOOKEEPER-2872 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2872 > Project: ZooKeeper > Issue Type: Bug > Components: server > Affects Versions: 3.4.10, 3.5.3, 3.6.0 > Reporter: Brian Nixon > > There is a way for observers to permanently lose data from their local data tree while remaining members of good standing with the ensemble and continuing to serve client traffic when the following chain of events occurs. > 1. The observer dies in epoch N from machine failure. > 2. The observer comes back up in epoch N+1 and requests a snapshot sync to catch up. > 3. The machine powers off before the snapshot is synced to disc and after some txn's have been logged (depending on the OS, this can happen!). > 4. The observer comes back a second time and replays its most recent snapshot (epoch <= N) as well as the txn logs (epoch N+1). > 5. A diff sync is requested from the leader and the observer broadcasts availability. > In this scenario, any commits from epoch N that the observer did not receive before it died the first time will never be exposed to the observer and no part of the ensemble will complain. > This situation is not unique to observers and can happen to any learner. As a simple fix, fsync-ing the snapshots received from the leader will avoid the case of missing snapshots causing data loss. -- This message was sent by Atlassian JIRA (v6.4.14#64029)