hadoop-zookeeper-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Patrick Hunt (JIRA)" <j...@apache.org>
Subject [jira] Commented: (ZOOKEEPER-914) QuorumCnxManager blocks forever
Date Fri, 05 Nov 2010 00:52:41 GMT

    [ https://issues.apache.org/jira/browse/ZOOKEEPER-914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12928441#action_12928441

Patrick Hunt commented on ZOOKEEPER-914:

Hi Vishal we do appreciate your feedback and interest. You've been doing a great job highlighting
issues and working to resolve them. Again, thanks. 

We also feel your frustrations. We wish we had unlimited time and resources to develop and
test ZK, unfortunately that's not the case. This is one of the many reasons why we brought
the project to Apache, to build community and gain insights of developers and users such as
yourself. Is everything "done", is it all "perfect" code? No. However the source is open,
the process is open, and we hope that more contributors will sign on to working together and
making significant contributions. This doesn't have to be just new features, it very much
could be testing (code and QA), documentation and all the other bits that go into useful software.

I encourage you to bring your QA related concerns to the larger group. That's something that
should be discussed on the dev list rather than here in a jira for a specific issue. As you
can see the primary committers work hard to address all the issues found. However there's
just not enough of us (and we ourselves work on this in our spare time to varying degrees).
Perhaps others will feel similarly and you can work to address some of the deficiencies. I'd
*love* to see more unit test and more system testing. If you want to make that happen I'd
do my best to support you.

Regards. (I'll let Flavio comment on the further specifics of this particular issue)

> QuorumCnxManager blocks forever 
> --------------------------------
>                 Key: ZOOKEEPER-914
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-914
>             Project: Zookeeper
>          Issue Type: Bug
>          Components: leaderElection
>            Reporter: Vishal K
>            Assignee: Vishal K
>            Priority: Blocker
>             Fix For: 3.3.3, 3.4.0
> This was a disaster. While testing our application we ran into a scenario where a rebooted
follower could not join the cluster. Further debugging showed that the follower could not
join because the QuorumCnxManager on the leader was blocked for indefinite amount of time
in receiveConnect()
> "Thread-3" prio=10 tid=0x00007fa920005800 nid=0x11bb runnable [0x00007fa9275ed000]
>    java.lang.Thread.State: RUNNABLE
>     at sun.nio.ch.FileDispatcher.read0(Native Method)
>     at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:21)
>     at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:233)
>     at sun.nio.ch.IOUtil.read(IOUtil.java:206)
>     at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:236)
>     - locked <0x00007fa93315f988> (a java.lang.Object)
>     at org.apache.zookeeper.server.quorum.QuorumCnxManager.receiveConnection(QuorumCnxManager.java:210)
>     at org.apache.zookeeper.server.quorum.QuorumCnxManager$Listener.run(QuorumCnxManager.java:501)
> I had pointed out this bug along with several other problems in QuorumCnxManager earlier
> https://issues.apache.org/jira/browse/ZOOKEEPER-900 and https://issues.apache.org/jira/browse/ZOOKEEPER-822.
> I forgot to patch this one as a part of ZOOKEEPER-822. I am working on a fix and a patch
will be out soon. 
> The problem is that QuorumCnxManager is using SocketChannel in blocking mode. It does
a read() in receiveConnection() and a write() in initiateConnection().
> Sorry, but this is really bad programming. Also, points out to lack of failure tests
for QuorumCnxManager.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message