zookeeper-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hudson (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (ZOOKEEPER-2872) Interrupted snapshot sync causes data loss
Date Fri, 18 Aug 2017 05:36:00 GMT

    [ https://issues.apache.org/jira/browse/ZOOKEEPER-2872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16131747#comment-16131747
] 

Hudson commented on ZOOKEEPER-2872:
-----------------------------------

SUCCESS: Integrated in Jenkins build ZooKeeper-trunk #3503 (See [https://builds.apache.org/job/ZooKeeper-trunk/3503/])
ZOOKEEPER-2872: Interrupted snapshot sync causes data loss (hanm: rev 0706b40afad079f19fe9f76c99bbb7ec69780dbd)
* (edit) src/java/test/org/apache/zookeeper/server/quorum/Zab1_0Test.java
* (edit) src/java/test/org/apache/zookeeper/test/TruncateTest.java
* (edit) src/java/main/org/apache/zookeeper/server/quorum/Learner.java
* (edit) src/java/main/org/apache/zookeeper/server/persistence/SnapShot.java
* (edit) src/java/main/org/apache/zookeeper/server/persistence/FileSnap.java
* (edit) src/java/main/org/apache/zookeeper/server/ZooKeeperServer.java
* (edit) src/java/main/org/apache/zookeeper/server/persistence/FileTxnSnapLog.java


> Interrupted snapshot sync causes data loss
> ------------------------------------------
>
>                 Key: ZOOKEEPER-2872
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2872
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: server
>    Affects Versions: 3.4.10, 3.5.3, 3.6.0
>            Reporter: Brian Nixon
>
> There is a way for observers to permanently lose data from their local data tree while
remaining members of good standing with the ensemble and continuing to serve client traffic
when the following chain of events occurs.
> 1. The observer dies in epoch N from machine failure.
> 2. The observer comes back up in epoch N+1 and requests a snapshot sync to catch up.
> 3. The machine powers off before the snapshot is synced to disc and after some txn's
have been logged (depending on the OS, this can happen!).
> 4. The observer comes back a second time and replays its most recent snapshot (epoch
<= N) as well as the txn logs (epoch N+1). 
> 5. A diff sync is requested from the leader and the observer broadcasts availability.
> In this scenario, any commits from epoch N that the observer did not receive before it
died the first time will never be exposed to the observer and no part of the ensemble will
complain. 
> This situation is not unique to observers and can happen to any learner. As a simple
fix, fsync-ing the snapshots received from the leader will avoid the case of missing snapshots
causing data loss.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message