zookeeper-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Artem Golotin (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (ZOOKEEPER-1863) Race condition in commit processor leading to out of order request completion, xid mismatch on client.
Date Mon, 01 Aug 2016 21:07:21 GMT

    [ https://issues.apache.org/jira/browse/ZOOKEEPER-1863?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15402819#comment-15402819
] 

Artem Golotin commented on ZOOKEEPER-1863:
------------------------------------------

Hi! I was able to reproduce the same issue as mentioned in [ZOOKEEPER-2151|https://issues.apache.org/jira/browse/ZOOKEEPER-2151].
Here is the output from the stat command: 
{noformat}
Zookeeper version: 3.5.1-alpha--1, built on 07/08/2016 17:08 GMT
Clients:
 /0:0:0:0:0:0:0:1:59312[0](queued=0,recved=1,sent=0)
 /10.2.29.7:34266[0](queued=0,recved=1,sent=0)
 /10.2.29.7:55996[0](queued=0,recved=1,sent=0)
 /10.2.81.3:48640[0](queued=0,recved=1,sent=0)

Latency min/avg/max: 0/1/826
Received: 2199007
Sent: 935511
Connections: 4
Outstanding: 48112
Zxid: 0x100011d4e
Mode: leader
Node count: 171
{noformat}

Notice the number of outstanding connections.... 
After examining the code I found that the patch that have been submitted with this issue was
applied to the code. With my limited experience I do not believe that I will be able to reproduce
the issue, but I am saving the full stdout zookeeper output and can provide it if needed.
However, I cannot provide you with a jstack dump, because when I try to do so (even with -F
option), it generates the following exception:
{noformat}
Error attaching to process: sun.jvm.hotspot.debugger.DebuggerException: Can't attach to the
process
{noformat}

Unfortunately, the issue seems to have happened on Sunday when I was not able to check the
health of my cluster or perform any immediate fixes. I do have several Kazoo clients + Kafka
+ Secor that are using zookeeper, and as all those are failing now with zookeeper outputting
the following logs:

{noformat}
2016-08-01 20:45:51,828 - INFO  [NIOServerCxnFactory.AcceptThread:/0.0.0.0:2181:NIOServerCnxnFactory$AcceptThread@296]
- Accepted socket connection from /10.2.81.3:50656
2016-08-01 20:45:51,829 - INFO  [NIOWorkerThread-4:ZooKeeperServer@931] - Client attempting
to establish new session at /10.2.81.3:50656
2016-08-01 20:45:53,723 - INFO  [NIOServerCxnFactory.AcceptThread:/0.0.0.0:2181:NIOServerCnxnFactory$AcceptThread@296]
- Accepted socket connection from /10.2.29.7:35862
2016-08-01 20:45:53,725 - INFO  [NIOWorkerThread-3:ZooKeeperServer@931] - Client attempting
to establish new session at /10.2.29.7:35862
2016-08-01 20:45:59,171 - INFO  [NIOServerCxnFactory.AcceptThread:/0.0.0.0:2181:NIOServerCnxnFactory$AcceptThread@296]
- Accepted socket connection from /10.2.81.3:50658
2016-08-01 20:45:59,172 - INFO  [NIOWorkerThread-1:ZooKeeperServer@931] - Client attempting
to establish new session at /10.2.81.3:50658
2016-08-01 20:46:00,223 - INFO  [ConnnectionExpirer:NIOServerCnxn@1007] - Closed socket connection
for client /10.2.29.7:35828 which had sessionid 0x10044f682c45bdd
2016-08-01 20:46:00,223 - INFO  [ConnnectionExpirer:NIOServerCnxn@1007] - Closed socket connection
for client /10.2.81.3:50656 which had sessionid 0x10044f682c45bdf
2016-08-01 20:46:05,013 - INFO  [NIOServerCxnFactory.AcceptThread:/0.0.0.0:2181:NIOServerCnxnFactory$AcceptThread@296]
- Accepted socket connection from /10.2.29.7:35898
2016-08-01 20:46:05,028 - INFO  [NIOWorkerThread-2:NIOServerCnxn@836] - Processing ruok command
from /10.2.29.7:35898
2016-08-01 20:46:05,029 - INFO  [NIOWorkerThread-2:NIOServerCnxn@1007] - Closed socket connection
for client /10.2.29.7:35898 (no session established for client)
2016-08-01 20:46:05,146 - INFO  [NIOServerCxnFactory.AcceptThread:/0.0.0.0:2181:NIOServerCnxnFactory$AcceptThread@296]
- Accepted socket connection from /10.2.29.7:35900
2016-08-01 20:46:05,153 - INFO  [NIOWorkerThread-4:NIOServerCnxn@836] - Processing srvr command
from /10.2.29.7:35900
2016-08-01 20:46:05,153 - INFO  [NIOWorkerThread-4:NIOServerCnxn@1007] - Closed socket connection
for client /10.2.29.7:35900 (no session established for client)
2016-08-01 20:46:05,617 - INFO  [NIOServerCxnFactory.AcceptThread:/0.0.0.0:2181:NIOServerCnxnFactory$AcceptThread@296]
- Accepted socket connection from /10.2.29.7:35904
2016-08-01 20:46:05,617 - INFO  [NIOWorkerThread-3:NIOServerCnxn@836] - Processing ruok command
from /10.2.29.7:35904
2016-08-01 20:46:05,618 - INFO  [NIOWorkerThread-3:NIOServerCnxn@1007] - Closed socket connection
for client /10.2.29.7:35904 (no session established for client)
2016-08-01 20:46:05,634 - INFO  [NIOServerCxnFactory.AcceptThread:/0.0.0.0:2181:NIOServerCnxnFactory$AcceptThread@296]
- Accepted socket connection from /10.2.29.7:35906
2016-08-01 20:46:05,634 - INFO  [NIOWorkerThread-1:NIOServerCnxn@836] - Processing srvr command
from /10.2.29.7:35906
2016-08-01 20:46:05,636 - INFO  [NIOWorkerThread-1:NIOServerCnxn@1007] - Closed socket connection
for client /10.2.29.7:35906 (no session established for client)
2016-08-01 20:46:05,644 - INFO  [NIOServerCxnFactory.AcceptThread:/0.0.0.0:2181:NIOServerCnxnFactory$AcceptThread@296]
- Accepted socket connection from /10.2.29.7:35908
2016-08-01 20:46:05,644 - INFO  [NIOWorkerThread-2:ZooKeeperServer@931] - Client attempting
to establish new session at /10.2.29.7:35908
2016-08-01 20:46:06,718 - INFO  [NIOServerCxnFactory.AcceptThread:/0.0.0.0:2181:NIOServerCnxnFactory$AcceptThread@296]
- Accepted socket connection from /10.2.81.3:50660
2016-08-01 20:46:06,719 - INFO  [NIOWorkerThread-4:ZooKeeperServer@931] - Client attempting
to establish new session at /10.2.81.3:50660
2016-08-01 20:46:10,224 - INFO  [ConnnectionExpirer:NIOServerCnxn@1007] - Closed socket connection
for client /10.2.81.3:50658 which had sessionid 0x10044f682c45be1
2016-08-01 20:46:10,225 - INFO  [ConnnectionExpirer:NIOServerCnxn@1007] - Closed socket connection
for client /10.2.29.7:35862 which had sessionid 0x10044f682c45be0
2016-08-01 20:46:13,933 - INFO  [NIOServerCxnFactory.AcceptThread:/0.0.0.0:2181:NIOServerCnxnFactory$AcceptThread@296]
- Accepted socket connection from /10.2.81.3:50662
2016-08-01 20:46:13,933 - INFO  [NIOWorkerThread-3:ZooKeeperServer@931] - Client attempting
to establish new session at /10.2.81.3:50662
2016-08-01 20:46:16,030 - INFO  [NIOServerCxnFactory.AcceptThread:/0.0.0.0:2181:NIOServerCnxnFactory$AcceptThread@296]
- Accepted socket connection from /10.2.29.7:35942
2016-08-01 20:46:16,032 - INFO  [NIOWorkerThread-1:ZooKeeperServer@931] - Client attempting
to establish new session at /10.2.29.7:35942
2016-08-01 20:46:18,646 - INFO  [NIOServerCxnFactory.AcceptThread:/0.0.0.0:2181:NIOServerCnxnFactory$AcceptThread@296]
- Accepted socket connection from /10.2.29.7:57696
2016-08-01 20:46:18,647 - INFO  [NIOWorkerThread-2:ZooKeeperServer@931] - Client attempting
to establish new session at /10.2.29.7:57696
2016-08-01 20:46:20,223 - INFO  [ConnnectionExpirer:NIOServerCnxn@1007] - Closed socket connection
for client /10.2.29.7:35908 which had sessionid 0x10044f682c45be2
2016-08-01 20:46:20,224 - INFO  [ConnnectionExpirer:NIOServerCnxn@1007] - Closed socket connection
for client /10.2.81.3:50662 which had sessionid 0x10044f682c45be4
2016-08-01 20:46:20,225 - INFO  [ConnnectionExpirer:NIOServerCnxn@1007] - Closed socket connection
for client /10.2.81.3:50660 which had sessionid 0x10044f682c45be3
2016-08-01 20:46:21,255 - INFO  [NIOServerCxnFactory.AcceptThread:/0.0.0.0:2181:NIOServerCnxnFactory$AcceptThread@296]
- Accepted socket connection from /10.2.81.3:50664
2016-08-01 20:46:21,256 - INFO  [NIOWorkerThread-4:ZooKeeperServer@931] - Client attempting
to establish new session at /10.2.81.3:50664
{noformat}

Also, I would like to point out that I am running zookeeper as a part of kubernetes cluster,
so the issue may be caused by kubernetes internal proxy, but I do not know why that would
cause an issue...

> Race condition in commit processor leading to out of order request completion, xid mismatch
on client.
> ------------------------------------------------------------------------------------------------------
>
>                 Key: ZOOKEEPER-1863
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1863
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: server
>    Affects Versions: 3.5.0
>            Reporter: Dutch T. Meyer
>            Assignee: Dutch T. Meyer
>            Priority: Blocker
>             Fix For: 3.5.0
>
>         Attachments: ZOOKEEPER-1863.patch, ZOOKEEPER-1863.patch, ZOOKEEPER-1863.patch,
ZOOKEEPER-1863.patch, ZOOKEEPER-1863.patch, ZOOKEEPER-1863.patch, ZOOKEEPER-1863.patch, stack.17512
>
>
> In CommitProcessor.java processor, if we are at the primary request handler on line 167:
> {noformat}
>                 while (!stopped && !isWaitingForCommit() &&
>                        !isProcessingCommit() &&
>                        (request = queuedRequests.poll()) != null) {
>                     if (needCommit(request)) {
>                         nextPending.set(request);
>                     } else {
>                         sendToNextProcessor(request);
>                     }
>                 }
> {noformat}
> A request can be handled in this block and be quickly processed and completed on another
thread. If queuedRequests is empty, we then exit the block. Next, before this thread makes
any more progress, we can get 2 more requests, one get_children(say), and a sync placed on
queuedRequests for the processor. Then, if we are very unlucky, the sync request can complete
and this object's commit() routine is called (from FollowerZookeeperServer), which places
the sync request on the previously empty committedRequests queue. At that point, this thread
continues.
> We reach line 182, which is a check on sync requests.
> {noformat}
>                 if (!stopped && !isProcessingRequest() &&
>                     (request = committedRequests.poll()) != null) {
> {noformat}
> Here we are not processing any requests, because the original request has completed.
We haven't dequeued either the read or the sync request in this processor. Next, the poll
above will pull the sync request off the queue, and in the following block, the sync will
get forwarded to the next processor.
> This is a problem because the read request hasn't been forwarded yet, so requests are
now out of order.
> I've been able to reproduce this bug reliably by injecting a Thread.sleep(5000) between
the two blocks above to make the race condition far more likely, then in a client program.
> {noformat}
>         zoo_aget_children(zh, "/", 0, getchildren_cb, NULL);
>         //Wait long enough for queuedRequests to drain
>         sleep(1);
>         zoo_aget_children(zh, "/", 0, getchildren_cb, &th_ctx[0]);
>         zoo_async(zh, "/", sync_cb, &th_ctx[0]);
> {noformat}
> When this bug is triggered, 3 things can happen:
> 1) Clients will see requests complete out of order and fail on xid mismatches.
> 2) Kazoo in particular doesn't handle this runtime exception well, and can orphan outstanding
requests.
> 3) I've seen zookeeper servers deadlock, likely because the commit cannot be completed,
which can wedge the commit processor.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message