zookeeper-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Andor Molnar (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (ZOOKEEPER-3182) Race condition when follower syncing with leader and starting to serve requests
Date Thu, 25 Oct 2018 20:26:00 GMT

    [ https://issues.apache.org/jira/browse/ZOOKEEPER-3182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16664247#comment-16664247
] 

Andor Molnar commented on ZOOKEEPER-3182:
-----------------------------------------

Got it. That was pretty much my other thinking: I wasn't sure about that the problem didn't
exist before and whether the unit test actually makes sense.
I'm not sure if we can issue sync, because reconnection / setWatches logic is automatic and
built into the client. Maybe we should just remove this test or remove the shutdown / restart
part, because it's not clear to me what exactly are we trying to test here by restarting the
follower.

> Race condition when follower syncing with leader and starting to serve requests
> -------------------------------------------------------------------------------
>
>                 Key: ZOOKEEPER-3182
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3182
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: server
>    Affects Versions: 3.6.0
>            Reporter: Andor Molnar
>            Priority: Critical
>
> This issue is probably introduced by ZOOKEEPER-2024 where 2 seperate queues have been
implemented in CommitProcessor to improve performance. [~abrahamfine] 's analysis is accurate
on GitHub: https://github.com/apache/zookeeper/pull/300
> He was trying to introduce synchronization between Learner.syncWithLeader() and CommitProcessor
to wait for in-flight requests to be committed before accepting client requests.
> In the affected unit test ({{testNodeDataChanged}}) there's a race between reconnecting
client's setWatches request and updates coming from the leader according to the following
logs:
> {noformat}
> 2018-10-25 13:59:58,556 [myid:] - DEBUG [FollowerRequestProcessor:1:CommitProcessor@424]
- Processing request:: sessionid:0x10005d8fc4d0000 type:setWatches cxid:0x3 zxid:0xfffffffffffffffe
txntype:unknown reqpath:n/a
> 2018-10-25 13:59:58,556 [myid:] - DEBUG [CommitProcWorkThread-1:FinalRequestProcessor@91]
- Processing request:: sessionid:0x10005d8fc4d0000 type:setWatches cxid:0x3 zxid:0xfffffffffffffffe
txntype:unknown reqpath:n/a
> ...
> 2018-10-25 13:59:58,557 [myid:] - DEBUG [CommitProcWorkThread-1:FinalRequestProcessor@91]
- Processing request:: sessionid:0x20005d8f8a40000 type:delete cxid:0x1 zxid:0x100000004 txntype:2
reqpath:n/a
> ...
> 2018-10-25 13:59:58,561 [myid:] - DEBUG [CommitProcWorkThread-1:FinalRequestProcessor@91]
- Processing request:: sessionid:0x20005d8f8a40000 type:create cxid:0x2 zxid:0x100000005 txntype:1
reqpath:n/a
> 2018-10-25 13:59:58,561 [myid:127.0.0.1:11231] - DEBUG [main-SendThread(127.0.0.1:11231):ClientCnxn$SendThread@864]
- Got WatchedEvent state:SyncConnected type:NodeDeleted path:/test-changed for sessionid 0x10005d8fc4d0000
> {noformat}
> {{setWatches}} request is processed before {{delete}} and {{create}}, hence the client
receives NodeDeleted event.
> In the working scenario it looks like:
> {noformat}
> 2018-10-25 14:04:55,247 [myid:] - DEBUG [CommitProcWorkThread-1:FinalRequestProcessor@91]
- Processing request:: sessionid:0x20005dd88110000 type:delete cxid:
> 0x1 zxid:0x100000004 txntype:2 reqpath:n/a
> 2018-10-25 14:04:55,249 [myid:] - DEBUG [CommitProcWorkThread-1:FinalRequestProcessor@91]
- Processing request:: sessionid:0x20005dd88110000 type:create cxid:
> 0x2 zxid:0x100000005 txntype:1 reqpath:n/a
> ...
> 2018-10-25 14:04:56,314 [myid:] - DEBUG [FollowerRequestProcessor:1:CommitProcessor@424]
- Processing request:: sessionid:0x10005dd88110000 type:setWatches cxid:0x3 zxid:0xfffffffffffffffe
txntype:unknown reqpath:n/a
> 2018-10-25 14:04:56,315 [myid:] - DEBUG [CommitProcWorkThread-1:FinalRequestProcessor@91]
- Processing request:: sessionid:0x10005dd88110000 type:setWatches cxid:0x3 zxid:0xfffffffffffffffe
txntype:unknown reqpath:n/a
> ...
> 2018-10-25 14:04:56,316 [myid:127.0.0.1:11231] - DEBUG [main-SendThread(127.0.0.1:11231):ClientCnxn$SendThread@842]
- Got notification sessionid:0x10005dd88110000
> 2018-10-25 14:04:56,316 [myid:127.0.0.1:11231] - DEBUG [main-SendThread(127.0.0.1:11231):ClientCnxn$SendThread@864]
- Got WatchedEvent state:SyncConnected type:NodeDataChanged path:/test-changed for sessionid
0x10005dd88110000
> {noformat}
> {{delete}} and {{create}} requests happen way before {{setWatches}} comes in (even before
the client connection is established) and client receives NodeDataChanged event only.
> Abe's approach unfortunately raises the following concerns:
> - modifies CommitProcessor's code which might affect performance and correctness ([~shralex]
raised on ZOOKEEPER-2807),
> - we experienced deadlocks while testing the patch: https://github.com/apache/zookeeper/pull/300
> As a consequence I raised this Jira to capture the experiences and to put the unit test
on Ignore list, because currently I'm not sure about whether this is a real issue or a non-backward
compatible change in 3.6 with the gain of a huge performance improvement.
> Either way I don't want this flaky test to influence contributions, so I'll mark as Ignored
on trunk until the issue is resolved.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message