zookeeper-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael Han (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (ZOOKEEPER-2745) Node loses data after disk-full event, but successfully joins Quorum
Date Fri, 14 Apr 2017 15:31:41 GMT

    [ https://issues.apache.org/jira/browse/ZOOKEEPER-2745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15969153#comment-15969153

Michael Han commented on ZOOKEEPER-2745:

Thanks for reporting this with all details!

bq. When restoring, zookeeper doesn't find any valid snapshot logs to restore from. So it
tries to reload its state from txn logs starting at zxid 0.

With ZOOKEEPER-2325 this should not happen. If there is no valid snapshot but with some tnx
log files, ZK process will abort. 

Regarding retain valid snapshots vs recent snapshots, from operational point of view it might
be good to retain a larger number of snapshots than the default value (3), because during
recover process ZK will scan for up to 100 snapshots and skip invalid ones until it hits a
valid snapshot. Because the way purge works, a valid txn log file appertains to the first
valid snapshot should also persistent so ZK can recover from the first valid snapshot and
tx log. 

The current approach of retaining most recent snapshots also have the benefit from a forensic
point of view - if something disastrous happened that required administrator intervention,
we might be interested to understand what happened and retain most recent snapshots can provide
more contexts. If we are only retaining valid snapshots, we would not know what have happened.

My recommendation for similar case is:
* Retain a larger number of snapshots. The chances all of the snapshots are invalid is rare.
* Apply fix in ZOOKEEPER-2325: in case all snapshots are invalid, prevent restoring from txn
logs which leads to bad state.
* Apply fix in ZOOKEEPER-261: to prevent a bad state populated to entire quorum. 

> Node loses data after disk-full event, but successfully joins Quorum
> --------------------------------------------------------------------
>                 Key: ZOOKEEPER-2745
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2745
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: server
>    Affects Versions: 3.4.6
>         Environment: Ubuntu 12.04
>            Reporter: Abhay Bothra
>            Priority: Critical
>         Attachments: ZOOKEEPER-2745.patch
> If disk is full on 1 zookeeper node in a 3 node ensemble, it is able to join the quorum
with partial data.
> Setup:
> --------
> - Running a 3 node zookeeper ensemble on Ubuntu 12.04 as upstart services. Let's call
the nodes: A, B and C.
> Observation:
> -----------------
> - Connecting to 2 (Node A and B) of the 3 nodes and doing an `ls` in zookeeper data directory
was giving:
> /foo
> /bar
> /baz
> But an `ls` on node C was giving:
> /baz
> - On node C, the zookeeper data directory had the following files:
> log.1001
> log.1600
> snapshot.1000 -> size 200
> snapshot.1200 -> size 269
> snapshot.1300 -> size 300
> - Snapshot sizes on node A and B were in the vicinity of 500KB
> -------
> - Disk was full on node C prior to the creation time of the small snapshot
>   files.
> - Looking at zookeeper server logs, we observed that zookeeper had crashed and restarted
a few times after the first instance of disk full. Everytime time zookeeper starts, it does
3 things:
>   1. Run the purge task to cleanup old snapshot and txn logs. Our
>   autopurge.snapRetainCount is set to 3.
>   2. Restore from the most recent valid snapshot and the txn logs that follow.
>   3. Take part in a leader election - realize it has missed something - become the follower
- get diff of missed txns from the current leader - create a new snapshot of its current state.
> - We confirmed that a valid snapshot of the system had existed prior to, and
>   immediately after the crash. Let's call this snapshot snapshot.800.
> - Over the next 3 restarts, zookeeper did the following:
>   - Purged older snapshots
>   - Restored from snapshot.800 + txn logs
>   - Synced up with master, tried to write its updated state to a new snapshot. Crashed
due to disk full. The snapshot file, even though invalid, had been created.
> - *Note*: This is the first source of the bug. It might be more appropriate to first
write the snapshot to a temporary file, and then rename it
> snapshot.<txn_id>. That would gives us more confidence in the validity of snapshots
in the data dir. 
> - Let's say the snapshot files created above were snapshot.850, snapshot.920 and snapshot.950
> - On the 4th restart, the purge task retained the 3 recent snapshots - snapshot.850,
snapshot.920, and snapshot.950, and proceeded to purge snapshot.800 and associated txn logs
assuming that they were no longer needed.
> - *Note*: This is the second source of the bug. Instead of retaining the 3 most recent
*valid* snapshots, the server just retains 3 most recent snapshots, regardless of their validity.
> - When restoring, zookeeper doesn't find any valid snapshot logs to restore from. So
it tries to reload its state from txn logs starting at zxid 0. However, those transaction
logs would have long ago been garbage collected. It reloads from whatever txn logs are present.
Let's say the only txn log file present (log.951) contains logs for zxid 951 to 998.  It reloads
from that log file, syncs with master - gets txns 999 and 1000, and writes the snapshot log
snapshot.1000 to disk. Now that we have deleted snapshot.800, we have enough free disk space
to write snapshot.1000. From this state onwards, zookeeper will always assume it has the state
till txn id 1000, even though it only has state from txn id 951 to 1000.

This message was sent by Atlassian JIRA

View raw message