Mailing-List: contact dev-help@zookeeper.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@zookeeper.apache.org
Date: Tue, 3 May 2016 06:15:13 +0000 (UTC)
From: "Yong Zhang (JIRA)" <jira@apache.org>
To: dev@zookeeper.apache.org
Message-ID: <JIRA.12909478.1446359193000.94412.1462256113015@Atlassian.JIRA>
In-Reply-To: <JIRA.12909478.1446359193000@Atlassian.JIRA>
References: <JIRA.12909478.1446359193000@Atlassian.JIRA> <JIRA.12909478.1446359193128@arcas>
Subject: [jira] [Commented] (ZOOKEEPER-2310) Snapshot files must be synced
 to prevent inconsistency or data loss
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable
archived-at: Tue, 03 May 2016 06:15:16 -0000


    [ https://issues.apache.org/jira/browse/ZOOKEEPER-2310?page=3Dcom.atlas=
sian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=3D=
15268170#comment-15268170 ]=20

Yong Zhang commented on ZOOKEEPER-2310:
---------------------------------------

Hi [~abhishekrai], this patch not work on Windows because of FileChannel.op=
en on a folder seems not support on Windows.=20
We do not always run zk server on Windows but have many code like hbase UT =
there.

> Snapshot files must be synced to prevent inconsistency or data loss
> -------------------------------------------------------------------
>
>                 Key: ZOOKEEPER-2310
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2310
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: server
>    Affects Versions: 3.4.6
>            Reporter: Abhishek Rai
>            Assignee: Abhishek Rai
>         Attachments: zookeeper-2310.patch
>
>
> Today, Zookeeper server syncs transaction log files to disk by default, b=
ut does not sync snapshot files.  Consequently, an untimely crash may resul=
t in a lost or incomplete snapshot file.  During recovery, if the server fi=
nds a valid older snapshot file, it will load it and replay subsequent log(=
s), skipping the incomplete snapshot file.  It's possible that the skipped =
file had some transactions which are not present in the replayed transactio=
n logs.  Since quorum synchronization is based on last transaction ID of ea=
ch server, this will never get noticed, resulting in inconsistency between =
servers and possible data loss.
> Following sequence of events describes a sample scenario where this can h=
appen:
> # Server F is a follower in a Zookeeper ensemble.
> # F's most recent valid snapshot file is named "snapshot.10" containing s=
tate up to zxid =3D 10.  F is currently writing to the transaction log file=
 "log.11", with the most recent zxid =3D 20.
> # Fresh round of election.
> # F receives a few new transactions 21 to 30 from new leader L as the "di=
ff".  Current server behavior is to dump current state plus diff to a new s=
napshot file, "snapshot.30".
> # F finalizes the snapshot file, but file contents are still buffered in =
OS caches.  Zookeeper does not sync snapshot file contents to disk.
> # F receives a new transaction 31 from the leader, which it appends to th=
e existing transaction log file, "log.11" and syncs the file to disk.
> # Server machine crashes or is cold rebooted.
> # After recovery, snapshot file "snapshot.30" may not exist or may be emp=
ty.  See below for why that may happen.
> # In either case, F looks for the last finalized snapshot file, finds and=
 loads "snapshot.10".  It then replays transactions from "log.11".  Ultimat=
ely, its last seen zxid will be 31, but it would not have replayed transact=
ions 21 to 30 received via the "diff" from the leader.
> # Clients which are connected to F may see different data than clients co=
nnected to other members of the ensemble, violating single system image inv=
ariant.  Also, if F were to become a leader at some point, it could use its=
 state to seed other servers, and they all could lose the writes in the mis=
sing interval above.
> *Notes:*
> - Reason why snapshot file may be missing or incomplete:
> -- Zookeeper does not sync the data directory after creating a snapshot f=
ile.  Even if a newly created file is synced to disk, if the corresponding =
directory entry is not, then the file will not be visible in the namespace.
> -- Zookeeper does not sync snapshot files.  So, they may be empty or inco=
mplete during recovery from an untimely crash.
> - In step (6) above, the server could also have written the new transacti=
on 31 to a new log file, "log.31".  The final outcome would still be the sa=
me.
> We are able to deterministically reproduce this problem using the followi=
ng steps:
> # Create a new Zookeeper ensemble on 3 hosts: A, B, and C.
> # Ensured each server has at least one snapshot file in its data dir.
> # Stop Zookeeper process on server A.
> # Slow down disk syncs on server A (see example script below). This ensur=
es that snapshot files written by Zookeeper don't make it to disk spontaneo=
usly.  Log files will be written to disk as Zookeeper explicitly issues a s=
ync call on such files.
> # Connect to server B and create a new znode /test1.
> # Start Zookeeper process on A, wait for it to write a new snapshot to it=
s datadir.  This snapshot would contain /test1 but it won=E2=80=99t be sync=
ed to disk yet.
> # Connect to A and verify that /test1 is visible.
> # Connect to B and create another znode /test2.  This will cause A=E2=80=
=99s transaction log to grow further to receive /test2.
> # Cold reboot A.
> # A=E2=80=99s last snapshot is a zero-sized file or is missing altogether=
 since it did not get synced to disk before reboot.  We have seen both in d=
ifferent runs.
> # Connect to A and verify that /test1 does not exist.  It exists on B and=
 C.
> Slowing down disk syncs:
> {noformat}
> echo 360000 | sudo tee /proc/sys/vm/dirty_writeback_centisecs
> echo 360000 | sudo tee /proc/sys/vm/dirty_expire_centisecs
> echo 99 | sudo tee /proc/sys/vm/dirty_background_ratio
> echo 99 | sudo tee /proc/sys/vm/dirty_ratio
> {noformat}


--
This message was sent by Atlassian JIRA
(v6.3.4#6332)