hadoop-zookeeper-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Vishal K (JIRA)" <j...@apache.org>
Subject [jira] Commented: (ZOOKEEPER-914) QuorumCnxManager blocks forever
Date Mon, 08 Nov 2010 15:16:08 GMT

    [ https://issues.apache.org/jira/browse/ZOOKEEPER-914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12929589#action_12929589

Vishal K commented on ZOOKEEPER-914:

Hi Flavio,

You are right. Sorry, my comment was not fair.

Regarding SO_TIMEOUT: Per my understanding, SO_TIMEOUT works only when a channel is set in
non-blocking mode using isConfigureBlocking(). If the channel is not configured to work in
non-blocking mode, setting SO_TIMEOUT has no effect. Please let me know if you think there
is a way to set timeout on the socket after accepting the connection (without configuring
the channel in non-blocking mode). The only way I know to use SO_TIMEOUT is by using channel.isConfigureBlocking(false).
The current code in QuorumCnxManager assumes use of blocking IO. We will have to handle partial
reads/writes. Please refer to my earlier question regarding SO_TIMEOUT for implementing non-blocking

I thought this fix was supposed to go in for 3.3.3. As I suggested earlier, one quick fix
to the problem is to use TimerTask(). Before doing blocking IO we can start a timer for that
channel (in receiveConnect() before read). Once the timer expires, check if the read() has
finished. If not, interrupt and close the channel. I think having such a fix (or some other
fix that will get around the problem) until the real fix is in is a better approach. Let me
what you think?

If we decide to go one of the quick fixes, then we can use this JIRA for that and use ZOOKEEPER-900
for the real fix.. Otherwise, as you suggested, we can close this JIRA and use ZOOKEEPER-900.


> QuorumCnxManager blocks forever 
> --------------------------------
>                 Key: ZOOKEEPER-914
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-914
>             Project: Zookeeper
>          Issue Type: Bug
>          Components: leaderElection
>            Reporter: Vishal K
>            Assignee: Vishal K
>            Priority: Blocker
>             Fix For: 3.3.3, 3.4.0
> This was a disaster. While testing our application we ran into a scenario where a rebooted
follower could not join the cluster. Further debugging showed that the follower could not
join because the QuorumCnxManager on the leader was blocked for indefinite amount of time
in receiveConnect()
> "Thread-3" prio=10 tid=0x00007fa920005800 nid=0x11bb runnable [0x00007fa9275ed000]
>    java.lang.Thread.State: RUNNABLE
>     at sun.nio.ch.FileDispatcher.read0(Native Method)
>     at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:21)
>     at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:233)
>     at sun.nio.ch.IOUtil.read(IOUtil.java:206)
>     at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:236)
>     - locked <0x00007fa93315f988> (a java.lang.Object)
>     at org.apache.zookeeper.server.quorum.QuorumCnxManager.receiveConnection(QuorumCnxManager.java:210)
>     at org.apache.zookeeper.server.quorum.QuorumCnxManager$Listener.run(QuorumCnxManager.java:501)
> I had pointed out this bug along with several other problems in QuorumCnxManager earlier
> https://issues.apache.org/jira/browse/ZOOKEEPER-900 and https://issues.apache.org/jira/browse/ZOOKEEPER-822.
> I forgot to patch this one as a part of ZOOKEEPER-822. I am working on a fix and a patch
will be out soon. 
> The problem is that QuorumCnxManager is using SocketChannel in blocking mode. It does
a read() in receiveConnection() and a write() in initiateConnection().
> Sorry, but this is really bad programming. Also, points out to lack of failure tests
for QuorumCnxManager.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message