zookeeper-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Beom Heyn Kim (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (ZOOKEEPER-2946) The truncate() function in FileTxnLog.java may fail to properly remove an uncommitted write resulting in data inconsistency
Date Mon, 27 Nov 2017 19:30:00 GMT

     [ https://issues.apache.org/jira/browse/ZOOKEEPER-2946?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Beom Heyn Kim updated ZOOKEEPER-2946:
-------------------------------------
    Attachment: zookeeper-2946.patch

When the follower receives TRUNC from the leader, it is possible that there is no transaction
log at the given zxid (it is possible the given zxid is captured by snapshot but not by the
transaction log). It still needs to delete the transactions that are not committed and newer
than the zxid sent by the leader. The patch tries to check if it is currently considering
the file that is newer than the zxid and try to delete it before the while loop in the function
truncate(long zxid) in the FileTxnLog.java.

> The truncate() function in FileTxnLog.java may fail to properly remove an uncommitted
write resulting in data inconsistency
> ---------------------------------------------------------------------------------------------------------------------------
>
>                 Key: ZOOKEEPER-2946
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2946
>             Project: ZooKeeper
>          Issue Type: Bug
>    Affects Versions: 3.4.11, 3.4.12
>            Reporter: Beom Heyn Kim
>         Attachments: zookeeper-2946.patch
>
>
> The truncate() function in FileTxnLog.java may fail to properly remove the uncommitted
write. This happens when the follower that has the uncommitted writes tries to resync with
the leader after a few epochs have past. The failure results in data inconsistency in the
in-memory data tree across nodes. Here is one procedure to reproduce the inconsistency. 
> Initially:
> # Start the ensemble with three nodes: node 0, 1 and 2 (the node 2 is the leader)
> # Create 5 znodes with initial values as follow (key = value)
> {noformat}
> /testDivergenceResync0 = 0
> /testDivergenceResync1 = 1
> /testDivergenceResync2 = 2
> /testDivergenceResync3 = 3
> /testDivergenceResync4 = 4
> {noformat}
> To Reproduce:
> # Diverge the node 2
> a. Shutdown the node 0 and 1
> b. Async setData to the node 2 writing 1000 to the key ‘/testDivergenceResync0’
> c. Shutdown the node 2
> # Restart the node 0 and 1 (let them finish with resync)
> # Diverge the node 1
> a. Shutdown the node 0
> b. Async setData to the node 1 writing 1001 to the key ‘/testDivergenceResync1’
> c. Shutdown the node 1
> # Restart the node 0, 1 and 2 (let them finish with resync)
> # Diverge the node 2
> a. Shutdown the node 0 and 1
> b. Async setData to the node 2 writing 1002 to the key ‘/testDivergenceResync2’
> c. Shutdown the node 2
> # Restart the node 0 and 2 (let them finish with resync)
> # Diverge the node 2
> a. Shutdown the node 0
> b. Async setData to the node 2 writing 1003 to the key ‘/testDivergenceResync3’
> c. Shutdown the node 2
> # Restart the node 0 and 1 (let them finish with resync)
> # Diverge the node 1
> a. Shutdown the node 0
> b. Async setData to the node 1 writing 1004 to the key ‘/testDivergenceResync4’
> c. Shutdown the node 1
> # Restart the node 0 and 2 (let them finish with resync)
> # Restart the node 1 (let it finish with resync)
> Reading each key from each node directly will give us the output:
> {noformat}
> /testDivergenceResync0 on the node 0 = 0
> /testDivergenceResync0 on the node 1 = 0
> /testDivergenceResync0 on the node 2 = 0
> /testDivergenceResync1 on the node 0 = 1001
> /testDivergenceResync1 on the node 1 = 1001
> /testDivergenceResync1 on the node 2 = 1001
> /testDivergenceResync2 on the node 0 = 2
> /testDivergenceResync2 on the node 1 = 2
> /testDivergenceResync2 on the node 2 = 2
> /testDivergenceResync3 on the node 0 = 3
> /testDivergenceResync3 on the node 1 = 3
> /testDivergenceResync3 on the node 2 = 1003
> /testDivergenceResync4 on the node 0 = 1004
> /testDivergenceResync4 on the node 1 = 1004
> /testDivergenceResync4 on the node 2 = 1004
> {noformat}
> Thus, the value of key /testDivergenceResync3 is inconsistent across nodes.
> What seems to happen:
> # At the step 7, setData (at zxid 0x400000001) writing value 1003 is committed on the
node 2.
> {panel:title=Log from the node 2:}
> ...
> 2017-11-16 03:08:14,123 [myid:2] - DEBUG [ProcessThread(sid:2 cport:-1)::CommitProcessor@174]
- Processing request:: sessionid:0x2000117327c0000 type:setData cxid:0x4 zxid:0x400000001
txntype:5 reqpath:n/a
> 2017-11-16 03:08:14,124 [myid:2] - DEBUG [ProcessThread(sid:2 cport:-1)::Leader@787]
- Proposing:: sessionid:0x2000117327c0000 type:setData cxid:0x4 zxid:0x400000001 txntype:5
reqpath:n/a
> 2017-11-16 03:08:14,124 [myid:2] - INFO  [SyncThread:2:FileTxnLog@209] - Creating new
log file: log.400000001
> 2017-11-16 03:08:14,188 [myid:2] - DEBUG [SyncThread:2:Leader@600] - Count for zxid:
0x400000001 is 1
> 2017-11-16 03:08:15,752 [myid:2] - INFO  [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:11227:Leader@512]
- Shutting down
> 2017-11-16 03:08:15,753 [myid:2] - INFO  [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:11227:Leader@518]
- Shutdown called
> java.lang.Exception: shutdown Leader! reason: Not sufficient followers synced, only synced
with sids: [ 2 ]
> ...
> {panel}
> # At the step 10, the node 2 is restarted and supposed to be properly resync’ed with
the node 0 which is the leader.
>     a. The node 0 sends TRUNC msg so that the node 2 can truncate the setData at zxid
0x400000001.. 
>     b. On the other hand, the node 2 tries to truncate log to get in sync with the leader
0x200000001. However, the node 2 failed to properly truncate the setData at zxid 0x400000001.
So, even if resync was finished, the value 1003 is still remained intact on the node 2 while
other nodes have value 3 for the same key.
>     c. It seems on the node 2, there is only log.100000001 and log.400000001 but no log.200000001.
This seems to cause failing to delete log.400000001 during truncate(). It looks like we will
be considering log.400000001 by the time returning from the init() of FileTxnLog.java so that
we will never execute ‘itr.logFile.delete()’ for the log.400000001. 
>    d. Then, after returning from the truncate(), loadDatabase() will be invoked and log.400000001
will be read and the setData at zxid 0x400000001 gets loaded into the in-memory data tree.
> {panel:title=Log from the node 2:}
> ...
> 2017-11-16 03:08:59,051 [myid:2] - INFO  [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:11227:Follower@65]
- FOLLOWING - LEADER ELECTION TOOK - 215
> 2017-11-16 03:08:59,052 [myid:2] - INFO  [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:11227:QuorumPeer$QuorumServer@184]
- Resolved hostname: 127.0.0.1 to address: /127.0.0.1
> 2017-11-16 03:08:59,125 [myid:2] - WARN  [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:11227:Learner@349]
- Truncating log to get in sync with the leader 0x200000001
> 2017-11-16 03:08:59,125 [myid:2] - DEBUG [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:11227:FileTxnLog$FileTxnIterator@606]
- Created new input stream /home/ben/project/strata/test-5-3-ZooKeeper-3.4.11-strata-0.1/data/2/version-2/log.100000001
> 2017-11-16 03:08:59,125 [myid:2] - DEBUG [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:11227:FileTxnLog$FileTxnIterator@609]
- Created new input archive /home/ben/project/strata/test-5-3-ZooKeeper-3.4.11-strata-0.1/data/2/version-2/log.100000001
> 2017-11-16 03:08:59,126 [myid:2] - DEBUG [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:11227:FileTxnLog$FileTxnIterator@647]
- EOF excepton java.io.EOFException
> 2017-11-16 03:08:59,126 [myid:2] - DEBUG [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:11227:FileTxnLog$FileTxnIterator@606]
- Created new input stream /home/ben/project/strata/test-5-3-ZooKeeper-3.4.11-strata-0.1/data/2/version-2/log.400000001
> 2017-11-16 03:08:59,126 [myid:2] - DEBUG [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:11227:FileTxnLog$FileTxnIterator@609]
- Created new input archive /home/ben/project/strata/test-5-3-ZooKeeper-3.4.11-strata-0.1/data/2/version-2/log.400000001
> 2017-11-16 03:08:59,126 [myid:2] - INFO  [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:11227:FileSnap@83]
- Reading snapshot /home/ben/project/strata/test-5-3-ZooKeeper-3.4.11-strata-0.1/data/2/version-2/snapshot.200000001
> 2017-11-16 03:08:59,127 [myid:2] - DEBUG [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:11227:FileTxnLog$FileTxnIterator@606]
- Created new input stream /home/ben/project/strata/test-5-3-ZooKeeper-3.4.11-strata-0.1/data/2/version-2/log.100000001
> 2017-11-16 03:08:59,127 [myid:2] - DEBUG [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:11227:FileTxnLog$FileTxnIterator@609]
- Created new input archive /home/ben/project/strata/test-5-3-ZooKeeper-3.4.11-strata-0.1/data/2/version-2/log.100000001
> 2017-11-16 03:08:59,128 [myid:2] - DEBUG [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:11227:FileTxnLog$FileTxnIterator@647]
- EOF excepton java.io.EOFException
> 2017-11-16 03:08:59,128 [myid:2] - DEBUG [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:11227:FileTxnLog$FileTxnIterator@606]
- Created new input stream /home/ben/project/strata/test-5-3-ZooKeeper-3.4.11-strata-0.1/data/2/version-2/log.400000001
> 2017-11-16 03:08:59,128 [myid:2] - DEBUG [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:11227:FileTxnLog$FileTxnIterator@609]
- Created new input archive /home/ben/project/strata/test-5-3-ZooKeeper-3.4.11-strata-0.1/data/2/version-2/log.400000001
> 2017-11-16 03:08:59,128 [myid:2] - DEBUG [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:11227:FileTxnLog$FileTxnIterator@647]
- EOF excepton java.io.EOFException
> 2017-11-16 03:08:59,131 [myid:2] - WARN  [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:11227:Learner@387]
- Got zxid 0x500000001 expected 0x1
> 2017-11-16 03:08:59,132 [myid:2] - INFO  [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:11227:FileTxnSnapLog@248]
- Snapshotting: 0x500000004 to /home/ben/project/strata/test-5-3-ZooKeeper-3.4.11-strata-0.1/data/2/version-2/snapshot.500000004
> ...
> {panel}
> {panel:title=Log from the node 0:}
> ...
> 2017-11-16 03:08:59,050 [myid:0] - INFO  [QuorumPeer[myid=0]/0:0:0:0:0:0:0:0:11221:Leader@372]
- LEADING - LEADER ELECTION TOOK - 222
> 2017-11-16 03:08:59,055 [myid:0] - INFO  [LearnerHandler-/127.0.0.1:54482:LearnerHandler@346]
- Follower sid: 2 : info : org.apache.zookeeper.server.quorum.QuorumPeer$QuorumServer@266e422
> 2017-11-16 03:08:59,124 [myid:0] - INFO  [LearnerHandler-/127.0.0.1:54482:LearnerHandler@401]
- Synchronizing with Follower sid: 2 maxCommittedLog=0x500000004 minCommittedLog=0x100000001
peerLastZxid=0x400000001
> 2017-11-16 03:08:59,124 [myid:0] - DEBUG [LearnerHandler-/127.0.0.1:54482:LearnerHandler@415]
- proposal size is 14
> 2017-11-16 03:08:59,124 [myid:0] - DEBUG [LearnerHandler-/127.0.0.1:54482:LearnerHandler@418]
- Sending proposals to follower
> 2017-11-16 03:08:59,124 [myid:0] - INFO  [LearnerHandler-/127.0.0.1:54482:LearnerHandler@475]
- Sending TRUNC
> 2017-11-16 03:08:59,147 [myid:0] - INFO  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:11221:NIOServerCnxnFactory@215]
- Accepted socket connection from /127.0.0.1:55118
> 2017-11-16 03:08:59,184 [myid:0] - WARN  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:11221:NIOServerCnxn@383]
- Exception causing close of session 0x0: ZooKeeperServer not running
> 2017-11-16 03:08:59,184 [myid:0] - DEBUG [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:11221:NIOServerCnxn@386]
- IOException stack trace
> java.io.IOException: ZooKeeperServer not running
>         at org.apache.zookeeper.server.NIOServerCnxn.readLength(NIOServerCnxn.java:977)
>         at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:257)
>         at org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:226)
>         at java.lang.Thread.run(Thread.java:745)
> 2017-11-16 03:08:59,184 [myid:0] - INFO  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:11221:NIOServerCnxn@1040]
- Closed socket connection for client /127.0.0.1:55118 (no session established for client)
> 2017-11-16 03:08:59,224 [myid:0] - INFO  [LearnerHandler-/127.0.0.1:54482:LearnerHandler@535]
- Received NEWLEADER-ACK message from 2
> 2017-11-16 03:08:59,224 [myid:0] - INFO  [QuorumPeer[myid=0]/0:0:0:0:0:0:0:0:11221:Leader@962]
- Have quorum of supporters, sids: [ 0,2 ]; starting up and setting last processed zxid: 0x600000000
> ...
> {panel}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message