hadoop-zookeeper-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Flavio Junqueira (JIRA)" <j...@apache.org>
Subject [jira] Commented: (ZOOKEEPER-914) QuorumCnxManager blocks forever
Date Sun, 07 Nov 2010 15:22:06 GMT

    [ https://issues.apache.org/jira/browse/ZOOKEEPER-914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12929345#action_12929345
] 

Flavio Junqueira commented on ZOOKEEPER-914:
--------------------------------------------

Hi Vishal, I also appreciate your contributions and your comments. I also understand your
frustration when you find issues with the code, but think that it is possibly equally frustrating
for the developer who thought that at least basic issues were covered, so please try to think
that we don't introduce bugs on purpose (at least I don't) and our review process is not perfect.


Regarding clover reports, we have agreed already that code coverage is not bulletproof, and
in fact there has been several other metrics proposed in the scientific literature, but it
does indicate that some call path including a give piece of code was exercised. It certainly
doesn't measure more complex cases, like race conditions, crashes and so on. In fact, if you
have a better way of measuring test coverage, I'd happy to hear about it.

I'm not sure if you agree, but it seems to me that we should close this jira because the technical
discussion here seems to be similar to the one of ZOOKEEPER-900. I'll try to address the concerns
you raised regardless of what will happen to this jira:

# My point about SO_TIMEOUT comes from here: http://download.oracle.com/javase/6/docs/api/java/net/Socket.html#setSoTimeout%28int%29
# I obviously prefer to go with real fixes instead of hacking, but we need to have release
3.3.2 out, and it sounded like introducing a configurable timeout would fix your problem until
the next release;
# About testing beyond the handshake, I'm not sure what you're proposing. If the blocking
calls are part of the handshake and this is what is failing for you, then this is what we
should target now, no?   

> QuorumCnxManager blocks forever 
> --------------------------------
>
>                 Key: ZOOKEEPER-914
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-914
>             Project: Zookeeper
>          Issue Type: Bug
>          Components: leaderElection
>            Reporter: Vishal K
>            Assignee: Vishal K
>            Priority: Blocker
>             Fix For: 3.3.3, 3.4.0
>
>
> This was a disaster. While testing our application we ran into a scenario where a rebooted
follower could not join the cluster. Further debugging showed that the follower could not
join because the QuorumCnxManager on the leader was blocked for indefinite amount of time
in receiveConnect()
> "Thread-3" prio=10 tid=0x00007fa920005800 nid=0x11bb runnable [0x00007fa9275ed000]
>    java.lang.Thread.State: RUNNABLE
>     at sun.nio.ch.FileDispatcher.read0(Native Method)
>     at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:21)
>     at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:233)
>     at sun.nio.ch.IOUtil.read(IOUtil.java:206)
>     at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:236)
>     - locked <0x00007fa93315f988> (a java.lang.Object)
>     at org.apache.zookeeper.server.quorum.QuorumCnxManager.receiveConnection(QuorumCnxManager.java:210)
>     at org.apache.zookeeper.server.quorum.QuorumCnxManager$Listener.run(QuorumCnxManager.java:501)
> I had pointed out this bug along with several other problems in QuorumCnxManager earlier
in 
> https://issues.apache.org/jira/browse/ZOOKEEPER-900 and https://issues.apache.org/jira/browse/ZOOKEEPER-822.
> I forgot to patch this one as a part of ZOOKEEPER-822. I am working on a fix and a patch
will be out soon. 
> The problem is that QuorumCnxManager is using SocketChannel in blocking mode. It does
a read() in receiveConnection() and a write() in initiateConnection().
> Sorry, but this is really bad programming. Also, points out to lack of failure tests
for QuorumCnxManager.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message