Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 41B5E200C62 for ; Wed, 12 Apr 2017 07:00:58 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 40405160B9E; Wed, 12 Apr 2017 05:00:58 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 39438160B9B for ; Wed, 12 Apr 2017 07:00:57 +0200 (CEST) Received: (qmail 69506 invoked by uid 500); 12 Apr 2017 05:00:56 -0000 Mailing-List: contact dev-help@zookeeper.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@zookeeper.apache.org Delivered-To: mailing list dev@zookeeper.apache.org Received: (qmail 69495 invoked by uid 99); 12 Apr 2017 05:00:56 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd2-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 12 Apr 2017 05:00:56 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id BE4951AA975 for ; Wed, 12 Apr 2017 05:00:55 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -99.202 X-Spam-Level: X-Spam-Status: No, score=-99.202 tagged_above=-999 required=6.31 tests=[KAM_ASCII_DIVIDERS=0.8, RP_MATCHES_RCVD=-0.001, SPF_PASS=-0.001, USER_IN_WHITELIST=-100] autolearn=disabled Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024) with ESMTP id CXrPLzaCtXup for ; Wed, 12 Apr 2017 05:00:53 +0000 (UTC) Received: from mailrelay1-us-west.apache.org (mailrelay1-us-west.apache.org [209.188.14.139]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTP id 65B3D5FDBE for ; Wed, 12 Apr 2017 05:00:50 +0000 (UTC) Received: from jira-lw-us.apache.org (unknown [207.244.88.139]) by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id 14D7BE0BDD for ; Wed, 12 Apr 2017 05:00:50 +0000 (UTC) Received: from jira-lw-us.apache.org (localhost [127.0.0.1]) by jira-lw-us.apache.org (ASF Mail Server at jira-lw-us.apache.org) with ESMTP id 49D4324096 for ; Wed, 12 Apr 2017 05:00:48 +0000 (UTC) Date: Wed, 12 Apr 2017 05:00:48 +0000 (UTC) From: "Abhay Bothra (JIRA)" To: dev@zookeeper.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Updated] (ZOOKEEPER-2745) Node loses data after disk-full event, but successfully joins Quorum MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 archived-at: Wed, 12 Apr 2017 05:00:58 -0000 [ https://issues.apache.org/jira/browse/ZOOKEEPER-2745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Abhay Bothra updated ZOOKEEPER-2745: ------------------------------------ Description: If disk is full on 1 zookeeper node in a 3 node ensemble, it is able to join the quorum with partial data. Setup: -------- - Running a 3 node zookeeper ensemble on Ubuntu 12.04 as upstart services. Let's call the nodes: A, B and C. Observation: ----------------- - Connecting to 2 (Node A and B) of the 3 nodes and doing an `ls` in zookeeper data directory was giving: /foo /bar /baz But an `ls` on node C was giving: /baz - On node C, the zookeeper data directory had the following files: log.1001 log.1600 snapshot.1000 -> size 200 snapshot.1200 -> size 269 snapshot.1300 -> size 300 - Snapshot sizes on node A and B were in the vicinity of 500KB RCA ------- - Disk was full on node C prior to the creation time of the small snapshot files. - Looking at zookeeper server logs, we observed that zookeeper had crashed and restarted a few times after the first instance of disk full. Everytime time zookeeper starts, it does 3 things: 1. Run the purge task to cleanup old snapshot and txn logs. Our autopurge.snapRetainCount is set to 3. 2. Restore from the most recent valid snapshot and the txn logs that follow. 3. Take part in a leader election - realize it has missed something - become the follower - get diff of missed txns from the current leader - create a new snapshot of its current state. - We confirmed that a valid snapshot of the system had existed prior to, and immediately after the crash. Let's call this snapshot snapshot.800. - Over the next 3 restarts, zookeeper did the following: - Purged older snapshots - Restored from snapshot.800 + txn logs - Synced up with master, tried to write its updated state to a new snapshot. Crashed due to disk full. The snapshot file, even though invalid, had been created. - *Note*: This is the first source of the bug. It might be more appropriate to first write the snapshot to a temporary file, and then rename it snapshot.. That would gives us more confidence in the validity of snapshots in the data dir. - Let's say the snapshot files created above were snapshot.850, snapshot.920 and snapshot.950 - On the 4th restart, the purge task retained the 3 recent snapshots - snapshot.850, snapshot.920, and snapshot.950, and proceeded to purge snapshot.800 and associated txn logs assuming that they were no longer needed. - *Note*: This is the second source of the bug. Instead of retaining the 3 most recent *valid* snapshots, the server just retains 3 most recent snapshots, regardless of their validity. - When restoring, zookeeper doesn't find any valid snapshot logs to restore from. So it tries to reload its state from txn logs starting at zxid 0. However, those transaction logs would have long ago been garbage collected. It reloads from whatever txn logs are present. Let's say the only txn log file present (log.951) contains logs for zxid 951 to 998. It reloads from that log file, syncs with master - gets txns 999 and 1000, and writes the snapshot log snapshot.1000 to disk. Now that we have deleted snapshot.800, we have enough free disk space to write snapshot.1000. From this state onwards, zookeeper will always assume it has the state till txn id 1000, even though it only has state from txn id 951 to 1000. was: If disk is full on 1 zookeeper node in a 3 node ensemble, it is able to join the quorum with partial data. Setup: -------- - Running a 3 node zookeeper ensemble on Ubuntu 12.04 as upstart services. Let's call the nodes: A, B and C. Observation: ----------------- - Connecting to 2 (Node A and B) of the 3 nodes and doing an `ls` in zookeeper data directory was giving: /foo /bar /baz But an `ls` on node C was giving: /baz - On node C, the zookeeper data directory had the following files: log.1001 log.1600 snapshot.1000 -> size 200 snapshot.1200 -> size 269 snapshot.1300 -> size 300 - Snapshot sizes on node A and B were in the vicinity of 500KB RCA ------- - Disk was full on node C prior to the creation time of the small snapshot files. - Looking at zookeeper server logs, we observed that zookeeper had crashed and restarted a few times after the first instance of disk full. Everytime time zookeeper starts, it does 3 things: 1. Run the purge task to cleanup old snapshot and txn logs. Our autopurge.snapRetainCount is set to 3. 2. Restore from the most recent valid snapshot and the txn logs that follow. 3. Take part in a leader election - realize it has missed something - become the follower - get diff of missed txns from the current leader - create a new snapshot of its current state. - We confirmed that a valid snapshot of the system had existed prior to, and immediately after the crash. Let's call this snapshot snapshot.800. - Over the next 3 restarts, zookeeper did the following: - Purged older snapshots - Restored from snapshot.800 + txn logs - Synced up with master, tried to write its updated state to a new snapshot. Crashed due to disk full. The snapshot file, even though invalid, had been created. - *Note*: This is the first source of the bug. It might be more appropriate to first write the snapshot to a temporary file, and then rename it snapshot.. That would gives us more confidence in the validity of snapshots in the data dir. - Let's say the snapshot files created above were snapshot.850, snapshot.920 and snapshot.950 - On the 4th restart, the purge task retained the 3 recent snapshots - snapshot.850, snapshot.920, and snapshot.950, and proceeded to purge snapshot.800 and associated txn logs assuming that they were no longer needed. - *Note*: This is the second source of the bug. Instead of retaining the 3 most recent *valid* snapshots, the server just retains 3 most recent snapshots, regardless of their validity. - When restoring, zookeeper doesn't find any valid snapshot logs to restore from. So it tries to reload its state from txn logs starting at zxid 0. However, those transaction logs would have long ago been garbage collected. It reloads from whatever txn logs are present. Let's say the only txn log file present (log.951) contains logs for zxid 951 to 998. It reloads from that log file, syncs with master - gets txns 999 and 1000, and writes the snapshot log snapshot.1000 to disk. Now that we have deleted snapshot.800, we have enough free disk space to write snapshot.1000. From this state onwards, zookeeper will always assume it has the state till txn id 1000, even though it only has state from txn id 951 to 1000. - *Note*: This is the third source of the bug. On restore, if snapshot.X is the latest valid snapshot file, the zookeeper server should validate that it can find txn with zxid X+1 in one of the txn logs. Otherwise, it should not use any of the txn logs during restore. > Node loses data after disk-full event, but successfully joins Quorum > -------------------------------------------------------------------- > > Key: ZOOKEEPER-2745 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2745 > Project: ZooKeeper > Issue Type: Bug > Components: server > Affects Versions: 3.4.6 > Environment: Ubuntu 12.04 > Reporter: Abhay Bothra > Priority: Critical > Attachments: ZOOKEEPER-2745.patch > > > If disk is full on 1 zookeeper node in a 3 node ensemble, it is able to join the quorum with partial data. > Setup: > -------- > - Running a 3 node zookeeper ensemble on Ubuntu 12.04 as upstart services. Let's call the nodes: A, B and C. > Observation: > ----------------- > - Connecting to 2 (Node A and B) of the 3 nodes and doing an `ls` in zookeeper data directory was giving: > /foo > /bar > /baz > But an `ls` on node C was giving: > /baz > - On node C, the zookeeper data directory had the following files: > log.1001 > log.1600 > snapshot.1000 -> size 200 > snapshot.1200 -> size 269 > snapshot.1300 -> size 300 > - Snapshot sizes on node A and B were in the vicinity of 500KB > RCA > ------- > - Disk was full on node C prior to the creation time of the small snapshot > files. > - Looking at zookeeper server logs, we observed that zookeeper had crashed and restarted a few times after the first instance of disk full. Everytime time zookeeper starts, it does 3 things: > 1. Run the purge task to cleanup old snapshot and txn logs. Our > autopurge.snapRetainCount is set to 3. > 2. Restore from the most recent valid snapshot and the txn logs that follow. > 3. Take part in a leader election - realize it has missed something - become the follower - get diff of missed txns from the current leader - create a new snapshot of its current state. > - We confirmed that a valid snapshot of the system had existed prior to, and > immediately after the crash. Let's call this snapshot snapshot.800. > - Over the next 3 restarts, zookeeper did the following: > - Purged older snapshots > - Restored from snapshot.800 + txn logs > - Synced up with master, tried to write its updated state to a new snapshot. Crashed due to disk full. The snapshot file, even though invalid, had been created. > - *Note*: This is the first source of the bug. It might be more appropriate to first write the snapshot to a temporary file, and then rename it > snapshot.. That would gives us more confidence in the validity of snapshots in the data dir. > - Let's say the snapshot files created above were snapshot.850, snapshot.920 and snapshot.950 > - On the 4th restart, the purge task retained the 3 recent snapshots - snapshot.850, snapshot.920, and snapshot.950, and proceeded to purge snapshot.800 and associated txn logs assuming that they were no longer needed. > - *Note*: This is the second source of the bug. Instead of retaining the 3 most recent *valid* snapshots, the server just retains 3 most recent snapshots, regardless of their validity. > - When restoring, zookeeper doesn't find any valid snapshot logs to restore from. So it tries to reload its state from txn logs starting at zxid 0. However, those transaction logs would have long ago been garbage collected. It reloads from whatever txn logs are present. Let's say the only txn log file present (log.951) contains logs for zxid 951 to 998. It reloads from that log file, syncs with master - gets txns 999 and 1000, and writes the snapshot log snapshot.1000 to disk. Now that we have deleted snapshot.800, we have enough free disk space to write snapshot.1000. From this state onwards, zookeeper will always assume it has the state till txn id 1000, even though it only has state from txn id 951 to 1000. -- This message was sent by Atlassian JIRA (v6.3.15#6346)