zookeeper-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Fangmin Lv (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (ZOOKEEPER-3182) Race condition when follower syncing with leader and starting to serve requests
Date Thu, 25 Oct 2018 20:00:00 GMT

    [ https://issues.apache.org/jira/browse/ZOOKEEPER-3182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16664232#comment-16664232
] 

Fangmin Lv commented on ZOOKEEPER-3182:
---------------------------------------

[~andorm] I'm trying to go through this problem to see if this is a real correctness issue
or by designed behavior.

So first, let's see if we're on the same page about the designed behavior. The setWatches
request is treated as a read request in Zeus, ZK only guarantees read after write for the
same session on the same server, if it reconnects to a different server, that server might
be slow, either slow syncing with leader or slow due to GC, it may not get the commits from
leader on time, so it doesn't guarantee you'll see it unless issuing sync even the follower
it connect to is already in broadcasting state.

This seems to be a test problem, not the code, and we should check if we can issue sync before
setWatches to fix the flaky test here.

> Race condition when follower syncing with leader and starting to serve requests
> -------------------------------------------------------------------------------
>
>                 Key: ZOOKEEPER-3182
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3182
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: server
>    Affects Versions: 3.6.0
>            Reporter: Andor Molnar
>            Priority: Critical
>
> This issue is probably introduced by ZOOKEEPER-2024 where 2 seperate queues have been
implemented in CommitProcessor to improve performance. [~abrahamfine] 's analysis is accurate
on GitHub: https://github.com/apache/zookeeper/pull/300
> He was trying to introduce synchronization between Learner.syncWithLeader() and CommitProcessor
to wait for in-flight requests to be committed before accepting client requests.
> In the affected unit test ({{testNodeDataChanged}}) there's a race between reconnecting
client's setWatches request and updates coming from the leader according to the following
logs:
> {noformat}
> 2018-10-25 13:59:58,556 [myid:] - DEBUG [FollowerRequestProcessor:1:CommitProcessor@424]
- Processing request:: sessionid:0x10005d8fc4d0000 type:setWatches cxid:0x3 zxid:0xfffffffffffffffe
txntype:unknown reqpath:n/a
> 2018-10-25 13:59:58,556 [myid:] - DEBUG [CommitProcWorkThread-1:FinalRequestProcessor@91]
- Processing request:: sessionid:0x10005d8fc4d0000 type:setWatches cxid:0x3 zxid:0xfffffffffffffffe
txntype:unknown reqpath:n/a
> ...
> 2018-10-25 13:59:58,557 [myid:] - DEBUG [CommitProcWorkThread-1:FinalRequestProcessor@91]
- Processing request:: sessionid:0x20005d8f8a40000 type:delete cxid:0x1 zxid:0x100000004 txntype:2
reqpath:n/a
> ...
> 2018-10-25 13:59:58,561 [myid:] - DEBUG [CommitProcWorkThread-1:FinalRequestProcessor@91]
- Processing request:: sessionid:0x20005d8f8a40000 type:create cxid:0x2 zxid:0x100000005 txntype:1
reqpath:n/a
> 2018-10-25 13:59:58,561 [myid:127.0.0.1:11231] - DEBUG [main-SendThread(127.0.0.1:11231):ClientCnxn$SendThread@864]
- Got WatchedEvent state:SyncConnected type:NodeDeleted path:/test-changed for sessionid 0x10005d8fc4d0000
> {noformat}
> {{setWatches}} request is processed before {{delete}} and {{create}}, hence the client
receives NodeDeleted event.
> In the working scenario it looks like:
> {noformat}
> 2018-10-25 14:04:55,247 [myid:] - DEBUG [CommitProcWorkThread-1:FinalRequestProcessor@91]
- Processing request:: sessionid:0x20005dd88110000 type:delete cxid:
> 0x1 zxid:0x100000004 txntype:2 reqpath:n/a
> 2018-10-25 14:04:55,249 [myid:] - DEBUG [CommitProcWorkThread-1:FinalRequestProcessor@91]
- Processing request:: sessionid:0x20005dd88110000 type:create cxid:
> 0x2 zxid:0x100000005 txntype:1 reqpath:n/a
> ...
> 2018-10-25 14:04:56,314 [myid:] - DEBUG [FollowerRequestProcessor:1:CommitProcessor@424]
- Processing request:: sessionid:0x10005dd88110000 type:setWatches cxid:0x3 zxid:0xfffffffffffffffe
txntype:unknown reqpath:n/a
> 2018-10-25 14:04:56,315 [myid:] - DEBUG [CommitProcWorkThread-1:FinalRequestProcessor@91]
- Processing request:: sessionid:0x10005dd88110000 type:setWatches cxid:0x3 zxid:0xfffffffffffffffe
txntype:unknown reqpath:n/a
> ...
> 2018-10-25 14:04:56,316 [myid:127.0.0.1:11231] - DEBUG [main-SendThread(127.0.0.1:11231):ClientCnxn$SendThread@842]
- Got notification sessionid:0x10005dd88110000
> 2018-10-25 14:04:56,316 [myid:127.0.0.1:11231] - DEBUG [main-SendThread(127.0.0.1:11231):ClientCnxn$SendThread@864]
- Got WatchedEvent state:SyncConnected type:NodeDataChanged path:/test-changed for sessionid
0x10005dd88110000
> {noformat}
> {{delete}} and {{create}} requests happen way before {{setWatches}} comes in (even before
the client connection is established) and client receives NodeDataChanged event only.
> Abe's approach unfortunately raises the following concerns:
> - modifies CommitProcessor's code which might affect performance and correctness ([~shralex]
raised on ZOOKEEPER-2807),
> - we experienced deadlocks while testing the patch: https://github.com/apache/zookeeper/pull/300
> As a consequence I raised this Jira to capture the experiences and to put the unit test
on Ignore list, because currently I'm not sure about whether this is a real issue or a non-backward
compatible change in 3.6 with the gain of a huge performance improvement.
> Either way I don't want this flaky test to influence contributions, so I'll mark as Ignored
on trunk until the issue is resolved.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message