zookeeper-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Beom Heyn Kim (JIRA)" <j...@apache.org>
Subject [jira] [Created] (ZOOKEEPER-2832) Data Inconsistency occurs if follower has uncommitted transaction in the log while synchronizing with the leader that has the lower last processed zxid
Date Sun, 02 Jul 2017 19:38:01 GMT
Beom Heyn Kim created ZOOKEEPER-2832:
----------------------------------------

             Summary: Data Inconsistency occurs if follower has uncommitted transaction in
the log while synchronizing with the leader that has the lower last processed zxid
                 Key: ZOOKEEPER-2832
                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2832
             Project: ZooKeeper
          Issue Type: Bug
          Components: quorum
    Affects Versions: 3.4.9
            Reporter: Beom Heyn Kim
             Fix For: 3.4.10


Synchronization code may fail to truncate an uncommitted transaction in the follower’s transaction
log. Here is a scenario:
 
Initial condition:
Start the ensemble with three nodes A, B and C with C being the leader
The current epoch is 1
For simplicity of the example, let’s say zxid is a two digit number, with epoch being the
first digit
Create two znodes ‘key0’ and ‘key1’ whose value is ‘0’ and ‘1’, respectively
The zxid is 12 -- 11 for creating key0 and 12 for creating key1. (For simplicity of the example,
the zxid gets increased only by transactions directly changing the data of znodes.)
All the nodes have seen the change 12 and have persistently logged it
Shut down all
 
Step 1
Start Node A and B. Epoch becomes 2. Then, a request, setData(key0, 1000), with zxid 21 is
issued. The leader B writes it to the log but Node A is shutdown before writing it to the
log. Then, the leader B is also shut down. The change 21 is applied only to B but not to A
or C.
 
Step 2
Start Node A and C. Epoch becomes 3. Node A has the higher zxid than Node C (i.e. 20 >
12). So, Node A becomes the leader. Yet, the last processed zxid is 12 for both Node A and
C. So, they are in sync already. Node A sends an empty DIFF to Node C. Node C takes a snapshot
and creates snapshot.12. Then, A and C are shut down. Now, C has the higher zxid than Node
B.
 
Step 3
Start Node B and C. Epoch becomes 4. Node C has the higher zxid than Node B (i.e. 30 >
21). So, Node C becomes the leader. Node B and C has the different last processed zxid (i.e.
21 vs 12), and the LinkedList object ‘proposals’ is empty. Thus, Node C sends SNAP to
Node B. Node B takes a clean snapshot and creates snapshot.12 as the zxid 12 is the last processed
zxid of the leader C. (Note the newly created snapshot on B is assigned the lower zxid then
the change 21 in the log). Then, the request, setData(key1, 1001), with zxid 41 is issued.
Both B and C apply the change 41 into their logs. (Note that now B and C have the same last
processed zxid) Then, B and C are shut down.
 
Step 4
Start Node B and C. Epoch becomes 5.  Node B and C use their local log and snapshot files
to restore their in-memory data tree. Node B has 1000 as the value of key0, because it’s
latest valid snapshot is snapshot.12 and there was a later transaction with zxid 21 in its
log. Yet, Node C has 0 as the value of key0, because the change 21 was never written on C.
Node C is the leader. Node B and C have the same last processed zxid, i.e. 41. So, they are
considered to be in sync already, and Node C sends an empty DIFF to Node B. So, the synchronization
completes with the initially restored in-memory data tree on B and C.
 
Problem
The value of key0 on B is 1000, while the value of the key0 on Node C is 0. The LearnerHandler.run
on C at Step 3, 	never sends TRUNC but just SNAP. So, the change 21 was never truncated on
B. Also, at step 4, since B uses the snapshot of the lower zxid to restore its in-memory data
tree, the change 21 could get into the data tree. Then, the leader C at the step 4 did not
send SNAP, because the change 41 made to both B and C makes the leader C think the B and C
are already in sync. Thus, data inconsistency occurs.
 
The attached test case can deterministically reproduce the bug.




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message