Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 0E01D200ABF for ; Tue, 3 May 2016 08:15:16 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 0C72416098B; Tue, 3 May 2016 08:15:16 +0200 (CEST) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 528181609A7 for ; Tue, 3 May 2016 08:15:15 +0200 (CEST) Received: (qmail 88909 invoked by uid 500); 3 May 2016 06:15:13 -0000 Mailing-List: contact dev-help@zookeeper.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@zookeeper.apache.org Delivered-To: mailing list dev@zookeeper.apache.org Received: (qmail 88537 invoked by uid 99); 3 May 2016 06:15:13 -0000 Received: from arcas.apache.org (HELO arcas) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 03 May 2016 06:15:13 +0000 Received: from arcas.apache.org (localhost [127.0.0.1]) by arcas (Postfix) with ESMTP id 0486E2C1F62 for ; Tue, 3 May 2016 06:15:13 +0000 (UTC) Date: Tue, 3 May 2016 06:15:13 +0000 (UTC) From: "Yong Zhang (JIRA)" To: dev@zookeeper.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (ZOOKEEPER-2310) Snapshot files must be synced to prevent inconsistency or data loss MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 archived-at: Tue, 03 May 2016 06:15:16 -0000 [ https://issues.apache.org/jira/browse/ZOOKEEPER-2310?page=3Dcom.atlas= sian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=3D= 15268170#comment-15268170 ]=20 Yong Zhang commented on ZOOKEEPER-2310: --------------------------------------- Hi [~abhishekrai], this patch not work on Windows because of FileChannel.op= en on a folder seems not support on Windows.=20 We do not always run zk server on Windows but have many code like hbase UT = there. > Snapshot files must be synced to prevent inconsistency or data loss > ------------------------------------------------------------------- > > Key: ZOOKEEPER-2310 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2310 > Project: ZooKeeper > Issue Type: Bug > Components: server > Affects Versions: 3.4.6 > Reporter: Abhishek Rai > Assignee: Abhishek Rai > Attachments: zookeeper-2310.patch > > > Today, Zookeeper server syncs transaction log files to disk by default, b= ut does not sync snapshot files. Consequently, an untimely crash may resul= t in a lost or incomplete snapshot file. During recovery, if the server fi= nds a valid older snapshot file, it will load it and replay subsequent log(= s), skipping the incomplete snapshot file. It's possible that the skipped = file had some transactions which are not present in the replayed transactio= n logs. Since quorum synchronization is based on last transaction ID of ea= ch server, this will never get noticed, resulting in inconsistency between = servers and possible data loss. > Following sequence of events describes a sample scenario where this can h= appen: > # Server F is a follower in a Zookeeper ensemble. > # F's most recent valid snapshot file is named "snapshot.10" containing s= tate up to zxid =3D 10. F is currently writing to the transaction log file= "log.11", with the most recent zxid =3D 20. > # Fresh round of election. > # F receives a few new transactions 21 to 30 from new leader L as the "di= ff". Current server behavior is to dump current state plus diff to a new s= napshot file, "snapshot.30". > # F finalizes the snapshot file, but file contents are still buffered in = OS caches. Zookeeper does not sync snapshot file contents to disk. > # F receives a new transaction 31 from the leader, which it appends to th= e existing transaction log file, "log.11" and syncs the file to disk. > # Server machine crashes or is cold rebooted. > # After recovery, snapshot file "snapshot.30" may not exist or may be emp= ty. See below for why that may happen. > # In either case, F looks for the last finalized snapshot file, finds and= loads "snapshot.10". It then replays transactions from "log.11". Ultimat= ely, its last seen zxid will be 31, but it would not have replayed transact= ions 21 to 30 received via the "diff" from the leader. > # Clients which are connected to F may see different data than clients co= nnected to other members of the ensemble, violating single system image inv= ariant. Also, if F were to become a leader at some point, it could use its= state to seed other servers, and they all could lose the writes in the mis= sing interval above. > *Notes:* > - Reason why snapshot file may be missing or incomplete: > -- Zookeeper does not sync the data directory after creating a snapshot f= ile. Even if a newly created file is synced to disk, if the corresponding = directory entry is not, then the file will not be visible in the namespace. > -- Zookeeper does not sync snapshot files. So, they may be empty or inco= mplete during recovery from an untimely crash. > - In step (6) above, the server could also have written the new transacti= on 31 to a new log file, "log.31". The final outcome would still be the sa= me. > We are able to deterministically reproduce this problem using the followi= ng steps: > # Create a new Zookeeper ensemble on 3 hosts: A, B, and C. > # Ensured each server has at least one snapshot file in its data dir. > # Stop Zookeeper process on server A. > # Slow down disk syncs on server A (see example script below). This ensur= es that snapshot files written by Zookeeper don't make it to disk spontaneo= usly. Log files will be written to disk as Zookeeper explicitly issues a s= ync call on such files. > # Connect to server B and create a new znode /test1. > # Start Zookeeper process on A, wait for it to write a new snapshot to it= s datadir. This snapshot would contain /test1 but it won=E2=80=99t be sync= ed to disk yet. > # Connect to A and verify that /test1 is visible. > # Connect to B and create another znode /test2. This will cause A=E2=80= =99s transaction log to grow further to receive /test2. > # Cold reboot A. > # A=E2=80=99s last snapshot is a zero-sized file or is missing altogether= since it did not get synced to disk before reboot. We have seen both in d= ifferent runs. > # Connect to A and verify that /test1 does not exist. It exists on B and= C. > Slowing down disk syncs: > {noformat} > echo 360000 | sudo tee /proc/sys/vm/dirty_writeback_centisecs > echo 360000 | sudo tee /proc/sys/vm/dirty_expire_centisecs > echo 99 | sudo tee /proc/sys/vm/dirty_background_ratio > echo 99 | sudo tee /proc/sys/vm/dirty_ratio > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)