zookeeper-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (ZOOKEEPER-2781) Flaky test: testClientAuthAgainstNoAuthServerWithLowerSid
Date Fri, 12 May 2017 23:31:04 GMT

    [ https://issues.apache.org/jira/browse/ZOOKEEPER-2781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16008906#comment-16008906
] 

ASF GitHub Bot commented on ZOOKEEPER-2781:
-------------------------------------------

GitHub user afine opened a pull request:

    https://github.com/apache/zookeeper/pull/251

    ZOOKEEPER-2781 Flaky test: testClientAuthAgainstNoAuthServerWithLowerSid

    The flaky test appears to be caused by a race condition in QuorumCnxManager that could
potentially prevent two servers from connecting to each other. I was able to reproduce the
issue with a debugger and a little bit of patience. It would be great if someone can share
a less contrived way to reproduce the same issue. Here is the basic order of execution required
to reproduce the issue between two peers (using lines of code from before this patch).  Point
of clarification, reaching a line means hitting but not yet executing that line (equivalent
to setting a breakpoint on that line).
    
    1. peer1 enters `startConnection` and reaches QuorumCnxManager.java:365
    2. peer0's Listener enters `handleConnection` reaches  QuorumCnxManager.java:506
    2. peer0 enters `startConnection` and reaches QuorumCnxManager.java:353
    3. peer1's Listener enters `handleConnection` and reaches QuorumCnxManager.java:483
    3. peer1 executes QuorumCnxManager.java:365 and reaches QuorumCnxManager.java:374
    4. peer0's Listener executes QuorumCnxManager.java:506 and starts a RecvWorker which stops
at QuorumCnxManager.java:1027. The Listener reaches QuorumCnxManager.java:516.
    5. peer1's Listener continues executing from QuorumCnxManager.java:483, which removes
the SendWorker and RecvWorker for its connection to peer0, and reaches QuorumCnxManager.java:493
    6. peer0's RecvWorker executes  QuorumCnxManager.java:1027, the socket had since been
closed on peer1 and we throw an exception
    ```
        [junit] 2017-05-07 14:48:11,055 [myid:] - WARN  [RecvWorker:1:QuorumCnxManager$RecvWorker@1042]
- Connection broken for id 1, my id = 0, error =
        [junit] java.io.EOFException
        [junit]     at java.io.DataInputStream.readInt(DataInputStream.java:392)
        [junit]     at org.apache.zookeeper.server.quorum.QuorumCnxManager$RecvWorker.run(QuorumCnxManager.java:1027)
    ```
    9. peer0 executes  QuorumCnxManager.java:353 reaching  QuorumCnxManager.java:377
    10. peer1 executes connectOne at QuorumCnxManager.java:493 and continues until it reaches
QuorumCnxManager.java:290 which completes the thread
    11. peer0's listener continues executing from QuorumCnxManager.java:516. Calls `handleConnection
` again but returns at QuorumCnxManager.java:469 since peer1 never wrote its sid to the socket.
    12. peer0 continues executing from QuorumCnxManager.java:377
    13. peer1 continues executing from QuorumCnxManager.java:374
    
    
    Both threads finish and no quorum has been formed. `testClientAuthAgainstNoAuthServerWithLowerSid`
times out
    
    


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/afine/zookeeper ZOOKEEPER-2781

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/zookeeper/pull/251.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #251
    
----
commit 5f231641a9496aaf84a0b58d87f5fc365fa9b7e6
Author: Abraham Fine <afine@apache.org>
Date:   2017-05-12T20:23:55Z

    ZOOKEEPER-2781: Flaky test: testClientAuthAgainstNoAuthServerWithLowerSid

----


> Flaky test: testClientAuthAgainstNoAuthServerWithLowerSid
> ---------------------------------------------------------
>
>                 Key: ZOOKEEPER-2781
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2781
>             Project: ZooKeeper
>          Issue Type: Bug
>    Affects Versions: 3.4.10
>            Reporter: Abraham Fine
>            Assignee: Abraham Fine
>
> Here is an example failing job: https://builds.apache.org/job/ZooKeeper_branch34_openjdk7/1489/



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Mime
View raw message