hadoop-zookeeper-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Vishal K (JIRA)" <j...@apache.org>
Subject [jira] Commented: (ZOOKEEPER-914) QuorumCnxManager blocks forever
Date Mon, 08 Nov 2010 19:24:08 GMT

    [ https://issues.apache.org/jira/browse/ZOOKEEPER-914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12929681#action_12929681
] 

Vishal K commented on ZOOKEEPER-914:
------------------------------------

Hi Flavio,

The documentation is not clear.
SO_TIMEOUT  has not effect on blocking channels. Non-blocking channels, wait for the specified
timeout if nothing is available in the buffer. Otherwise, it returns whatever bytes are currently
available in the buffer. I wrote a test the following test to verify this. Let me know if
you know about way to make SO_TIMEOUT to work.
 
        QuorumPeer peerLeader = new QuorumPeer(peers, tmpdir[1], tmpdir[1], port[1], 3, 0,
2, 2, 2);
        QuorumCnxManager cnxManager = new QuorumCnxManager(peerLeader);
        QuorumCnxManager.Listener listener = cnxManager.listener;
        SocketChannel channel = SocketChannel.open();
        channel.socket().connect(peers.get(new Long(0)).electionAddr, 5000);
        channel.configureBlocking(false);
        channel.socket().setSoTimeout(1000);
        byte[] msgBytes = new byte[8];
        ByteBuffer msgBuffer = ByteBuffer.wrap(msgBytes);

        /**
         * Don't send any data and call read() and see how long it waits.
         */
        long begin = System.currentTimeMillis();
        channel.read(msgBuffer);
       long end = System.currentTimeMillis();

Feel to free close duplicate bugs.

> QuorumCnxManager blocks forever 
> --------------------------------
>
>                 Key: ZOOKEEPER-914
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-914
>             Project: Zookeeper
>          Issue Type: Bug
>          Components: leaderElection
>            Reporter: Vishal K
>            Assignee: Vishal K
>            Priority: Blocker
>             Fix For: 3.3.3, 3.4.0
>
>
> This was a disaster. While testing our application we ran into a scenario where a rebooted
follower could not join the cluster. Further debugging showed that the follower could not
join because the QuorumCnxManager on the leader was blocked for indefinite amount of time
in receiveConnect()
> "Thread-3" prio=10 tid=0x00007fa920005800 nid=0x11bb runnable [0x00007fa9275ed000]
>    java.lang.Thread.State: RUNNABLE
>     at sun.nio.ch.FileDispatcher.read0(Native Method)
>     at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:21)
>     at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:233)
>     at sun.nio.ch.IOUtil.read(IOUtil.java:206)
>     at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:236)
>     - locked <0x00007fa93315f988> (a java.lang.Object)
>     at org.apache.zookeeper.server.quorum.QuorumCnxManager.receiveConnection(QuorumCnxManager.java:210)
>     at org.apache.zookeeper.server.quorum.QuorumCnxManager$Listener.run(QuorumCnxManager.java:501)
> I had pointed out this bug along with several other problems in QuorumCnxManager earlier
in 
> https://issues.apache.org/jira/browse/ZOOKEEPER-900 and https://issues.apache.org/jira/browse/ZOOKEEPER-822.
> I forgot to patch this one as a part of ZOOKEEPER-822. I am working on a fix and a patch
will be out soon. 
> The problem is that QuorumCnxManager is using SocketChannel in blocking mode. It does
a read() in receiveConnection() and a write() in initiateConnection().
> Sorry, but this is really bad programming. Also, points out to lack of failure tests
for QuorumCnxManager.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message