zookeeper-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Botond Hejj (JIRA)" <j...@apache.org>
Subject [jira] [Created] (ZOOKEEPER-3072) Race condition in throttling
Date Fri, 29 Jun 2018 14:01:00 GMT
Botond Hejj created ZOOKEEPER-3072:

             Summary: Race condition in throttling
                 Key: ZOOKEEPER-3072
                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3072
             Project: ZooKeeper
          Issue Type: Bug
          Components: server
    Affects Versions: 3.5.4, 3.5.3, 3.5.2, 3.5.1, 3.5.0
            Reporter: Botond Hejj

There is a race condition in the server throttling code. It is possible that the disableRecv
is called after enableRecv.

Basically, the I/O work thread does this in processPacket: [https://github.com/apache/zookeeper/blob/release-3.5.3/src/java/main/org/apache/zookeeper/server/ZooKeeperServer.java#L1102] 







incrOutstandingRequests() checks for limit breach, and potentially turns on throttling, [https://github.com/apache/zookeeper/blob/release-3.5.3/src/java/main/org/apache/zookeeper/server/NIOServerCnxn.java#L384]


submitRequest() will create a logical request and en-queue it so that Processor thread can
pick it up. After being de-queued by Processor thread, it does necessary handling, and then
calls this [https://github.com/apache/zookeeper/blob/release-3.5.3/src/java/main/org/apache/zookeeper/server/FinalRequestProcessor.java#L459] :


            cnxn.sendResponse(hdr, rsp, "response");


and in sendResponse(), it first appends to outgoing buffer, and then checks if un-throttle
is needed:  [https://github.com/apache/zookeeper/blob/release-3.5.3/src/java/main/org/apache/zookeeper/server/NIOServerCnxn.java#L708]


However, if there is a context switch between submitRequest() and cnxn.incrOutstandingRequests(),
so that Processor thread completes cnxn.sendResponse() call before I/O thread switches back,
then enableRecv() will happen before disableRecv(), and enableRecv() will fail the CAS ops,
while disableRecv() will succeed, resulting in a deadlock: un-throttle is needed for letting
in requests, and sendResponse is needed to trigger un-throttle, but sendResponse() requires
an incoming message. From that point on, ZK server will no longer select the affected client
socket for read, leading to the observed client-side failure in the subject.

If you would like to reproduce this than setting the globalOutstandingLimit down to 1 makes
this reproducible easier as throttling starts with less requests. 


This message was sent by Atlassian JIRA

View raw message