zookeeper-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From afine <...@git.apache.org>
Subject [GitHub] zookeeper pull request #251: ZOOKEEPER-2781 Flaky test: testClientAuthAgains...
Date Fri, 12 May 2017 23:30:59 GMT
GitHub user afine opened a pull request:


    ZOOKEEPER-2781 Flaky test: testClientAuthAgainstNoAuthServerWithLowerSid

    The flaky test appears to be caused by a race condition in QuorumCnxManager that could
potentially prevent two servers from connecting to each other. I was able to reproduce the
issue with a debugger and a little bit of patience. It would be great if someone can share
a less contrived way to reproduce the same issue. Here is the basic order of execution required
to reproduce the issue between two peers (using lines of code from before this patch).  Point
of clarification, reaching a line means hitting but not yet executing that line (equivalent
to setting a breakpoint on that line).
    1. peer1 enters `startConnection` and reaches QuorumCnxManager.java:365
    2. peer0's Listener enters `handleConnection` reaches  QuorumCnxManager.java:506
    2. peer0 enters `startConnection` and reaches QuorumCnxManager.java:353
    3. peer1's Listener enters `handleConnection` and reaches QuorumCnxManager.java:483
    3. peer1 executes QuorumCnxManager.java:365 and reaches QuorumCnxManager.java:374
    4. peer0's Listener executes QuorumCnxManager.java:506 and starts a RecvWorker which stops
at QuorumCnxManager.java:1027. The Listener reaches QuorumCnxManager.java:516.
    5. peer1's Listener continues executing from QuorumCnxManager.java:483, which removes
the SendWorker and RecvWorker for its connection to peer0, and reaches QuorumCnxManager.java:493
    6. peer0's RecvWorker executes  QuorumCnxManager.java:1027, the socket had since been
closed on peer1 and we throw an exception
        [junit] 2017-05-07 14:48:11,055 [myid:] - WARN  [RecvWorker:1:QuorumCnxManager$RecvWorker@1042]
- Connection broken for id 1, my id = 0, error =
        [junit] java.io.EOFException
        [junit]     at java.io.DataInputStream.readInt(DataInputStream.java:392)
        [junit]     at org.apache.zookeeper.server.quorum.QuorumCnxManager$RecvWorker.run(QuorumCnxManager.java:1027)
    9. peer0 executes  QuorumCnxManager.java:353 reaching  QuorumCnxManager.java:377
    10. peer1 executes connectOne at QuorumCnxManager.java:493 and continues until it reaches
QuorumCnxManager.java:290 which completes the thread
    11. peer0's listener continues executing from QuorumCnxManager.java:516. Calls `handleConnection
` again but returns at QuorumCnxManager.java:469 since peer1 never wrote its sid to the socket.
    12. peer0 continues executing from QuorumCnxManager.java:377
    13. peer1 continues executing from QuorumCnxManager.java:374
    Both threads finish and no quorum has been formed. `testClientAuthAgainstNoAuthServerWithLowerSid`
times out

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/afine/zookeeper ZOOKEEPER-2781

Alternatively you can review and apply these changes as the patch at:


To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #251
commit 5f231641a9496aaf84a0b58d87f5fc365fa9b7e6
Author: Abraham Fine <afine@apache.org>
Date:   2017-05-12T20:23:55Z

    ZOOKEEPER-2781: Flaky test: testClientAuthAgainstNoAuthServerWithLowerSid


If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.

View raw message