zookeeper-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Beom Heyn Kim (JIRA)" <j...@apache.org>
Subject [jira] [Issue Comment Deleted] (ZOOKEEPER-2832) Data Inconsistency occurs if follower has uncommitted transaction in the log while synchronizing with the leader that has the lower last processed zxid
Date Sun, 02 Jul 2017 19:44:00 GMT

     [ https://issues.apache.org/jira/browse/ZOOKEEPER-2832?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Beom Heyn Kim updated ZOOKEEPER-2832:
-------------------------------------
    Comment: was deleted

(was: This test can trigger the bug.)

> Data Inconsistency occurs if follower has uncommitted transaction in the log while synchronizing
with the leader that has the lower last processed zxid
> -------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: ZOOKEEPER-2832
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2832
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: quorum
>    Affects Versions: 3.4.9
>            Reporter: Beom Heyn Kim
>             Fix For: 3.4.10
>
>         Attachments: zookeeper-2832.patch
>
>
> Synchronization code may fail to truncate an uncommitted transaction in the follower’s
transaction log. Here is a scenario:
>  
> Initial condition:
> Start the ensemble with three nodes A, B and C with C being the leader
> The current epoch is 1
> For simplicity of the example, let’s say zxid is a two digit number, with epoch being
the first digit
> Create two znodes ‘key0’ and ‘key1’ whose value is ‘0’ and ‘1’, respectively
> The zxid is 12 -- 11 for creating key0 and 12 for creating key1. (For simplicity of the
example, the zxid gets increased only by transactions directly changing the data of znodes.)
> All the nodes have seen the change 12 and have persistently logged it
> Shut down all
>  
> Step 1
> Start Node A and B. Epoch becomes 2. Then, a request, setData(key0, 1000), with zxid
21 is issued. The leader B writes it to the log but Node A is shutdown before writing it to
the log. Then, the leader B is also shut down. The change 21 is applied only to B but not
to A or C.
>  
> Step 2
> Start Node A and C. Epoch becomes 3. Node A has the higher zxid than Node C (i.e. 20
> 12). So, Node A becomes the leader. Yet, the last processed zxid is 12 for both Node
A and C. So, they are in sync already. Node A sends an empty DIFF to Node C. Node C takes
a snapshot and creates snapshot.12. Then, A and C are shut down. Now, C has the higher zxid
than Node B.
>  
> Step 3
> Start Node B and C. Epoch becomes 4. Node C has the higher zxid than Node B (i.e. 30
> 21). So, Node C becomes the leader. Node B and C has the different last processed zxid
(i.e. 21 vs 12), and the LinkedList object ‘proposals’ is empty. Thus, Node C sends SNAP
to Node B. Node B takes a clean snapshot and creates snapshot.12 as the zxid 12 is the last
processed zxid of the leader C. (Note the newly created snapshot on B is assigned the lower
zxid then the change 21 in the log). Then, the request, setData(key1, 1001), with zxid 41
is issued. Both B and C apply the change 41 into their logs. (Note that now B and C have the
same last processed zxid) Then, B and C are shut down.
>  
> Step 4
> Start Node B and C. Epoch becomes 5.  Node B and C use their local log and snapshot files
to restore their in-memory data tree. Node B has 1000 as the value of key0, because it’s
latest valid snapshot is snapshot.12 and there was a later transaction with zxid 21 in its
log. Yet, Node C has 0 as the value of key0, because the change 21 was never written on C.
Node C is the leader. Node B and C have the same last processed zxid, i.e. 41. So, they are
considered to be in sync already, and Node C sends an empty DIFF to Node B. So, the synchronization
completes with the initially restored in-memory data tree on B and C.
>  
> Problem
> The value of key0 on B is 1000, while the value of the key0 on Node C is 0. The LearnerHandler.run
on C at Step 3, 	never sends TRUNC but just SNAP. So, the change 21 was never truncated on
B. Also, at step 4, since B uses the snapshot of the lower zxid to restore its in-memory data
tree, the change 21 could get into the data tree. Then, the leader C at the step 4 did not
send SNAP, because the change 41 made to both B and C makes the leader C think the B and C
are already in sync. Thus, data inconsistency occurs.
>  
> The attached test case can deterministically reproduce the bug.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message