zookeeper-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Robert Joseph Evans (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (ZOOKEEPER-2845) Data inconsistency issue due to retain database in leader election
Date Fri, 09 Feb 2018 16:37:00 GMT

    [ https://issues.apache.org/jira/browse/ZOOKEEPER-2845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16358627#comment-16358627

Robert Joseph Evans commented on ZOOKEEPER-2845:


Perhaps I don't understand the issue well enough which is totally possible because I am not
a frequent contributor and the path for all of the request processors is kind of complex.

My understanding is that the SyncRequestProcessor handles writing out edits to the edit log
and snapshots, there are a few other places where this happens at startup though. The SyncRequestProcessor
writes out edits as they arrive and will flush them to disk periodically in batches. It also
takes snapshots periodically.

The in memory portion appears to be updated by the FinalRequestProcessor prior to a quorum
of acks being received.

So yes there is the possibility that something is written to the transaction log that is not
applied to memory. This means that when ZKDatabase.clear() is called it should actually fast
forward the in memory changes to match those in the edit log + snapshot.

So you are saying that 
 1) proposals come in, are written to the transaction log, but the in memory database is not
updated yet.
 2) the server does a soft restart for some reason and some transactions appear to be lost
(because the in memory DB was not fast forwarded).
 3) more transactions come in (possibly conflicting with the first set of transactions).
 4) before a snapshot can happen the leader or follower restarts and has to reconstruct the
in memory DB from edits + snapshot. This would then reapply the edits that originally appeared
to be lost.

This does look like it might happen, so I will look into that as well.

But the test in [https://github.com/apache/zookeeper/pull/310] didn't appear to trigger this.
I could be wrong because I concentrated most of my debugging on the original leader and what
was happening with it, instead of the followers and what was happening with them. I also didn't
understand how clearing the leader's in memory database caused an edit to be lost, if the
edits are being written out to disk before the in memory DB is updated. What I saw was that

1) a bunch of edits and leaders/followers being restarted that didn't really do much of anything.
 2) the original leader lost a connection to the followers.
 3a) A transaction was written to the leader in memory DB but it didn't get a quorum of acks
 3b) The followers restarted and formed a new quorum
 4) The original leader timed out and joined the new quorum
 5) As part of the sync when the old leader joined the new quorum it got a diff (not a snap),
but it had an edit that was not a part of the new leader so it was off from the others.

I could see this second part happening even without my change so I don't really understand
how that clearing the database would prevent it.  My thinking was that it was a race condition
where the edits in the edit log were not flushed yet, and as such when we cleared the DB they
were lost.  But I didn't confirm this.

> Data inconsistency issue due to retain database in leader election
> ------------------------------------------------------------------
>                 Key: ZOOKEEPER-2845
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2845
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: quorum
>    Affects Versions: 3.4.10, 3.5.3, 3.6.0
>            Reporter: Fangmin Lv
>            Assignee: Robert Joseph Evans
>            Priority: Critical
> In ZOOKEEPER-2678, the ZKDatabase is retained to reduce the unavailable time during leader
election. In ZooKeeper ensemble, it's possible that the snapshot is ahead of txn file (due
to slow disk on the server, etc), or the txn file is ahead of snapshot due to no commit message
being received yet. 
> If snapshot is ahead of txn file, since the SyncRequestProcessor queue will be drained
during shutdown, the snapshot and txn file will keep consistent before leader election happening,
so this is not an issue.
> But if txn is ahead of snapshot, it's possible that the ensemble will have data inconsistent
issue, here is the simplified scenario to show the issue:
> Let's say we have a 3 servers in the ensemble, server A and B are followers, and C is
leader, and all the snapshot and txn are up to T0:
> 1. A new request reached to leader C to create Node N, and it's converted to txn T1 
> 2. Txn T1 was synced to disk in C, but just before the proposal reaching out to the followers,
A and B restarted, so the T1 didn't exist in A and B
> 3. A and B formed a new quorum after restart, let's say B is the leader
> 4. C changed to looking state due to no enough followers, it will sync with leader B
with last Zxid T0, which will have an empty diff sync
> 5. Before C take snapshot it restarted, it replayed the txns on disk which includes T1,
now it will have Node N, but A and B doesn't have it.
> Also I included the a test case to reproduce this issue consistently. 
> We have a totally different RetainDB version which will avoid this issue by doing consensus
between snapshot and txn files before leader election, will submit for review.

This message was sent by Atlassian JIRA

View raw message