zookeeper-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Raul Gutierrez Segales (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (ZOOKEEPER-2202) Cluster crashes when reconfig adds an unreachable observer
Date Thu, 08 Dec 2016 01:47:58 GMT

    [ https://issues.apache.org/jira/browse/ZOOKEEPER-2202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15730773#comment-15730773

Raul Gutierrez Segales commented on ZOOKEEPER-2202:

[~hanm], [~phunt], [~shralex]: this is still hurting us in production, could we get it reviewed
for 3.5.3 pls? Thanks!

> Cluster crashes when reconfig adds an unreachable observer
> ----------------------------------------------------------
>                 Key: ZOOKEEPER-2202
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2202
>             Project: ZooKeeper
>          Issue Type: Bug
>    Affects Versions: 3.5.0, 3.6.0
>            Reporter: Raul Gutierrez Segales
>            Assignee: Raul Gutierrez Segales
>             Fix For: 3.5.3, 3.6.0
>         Attachments: ZOOKEEPER-2202.patch
> While adding support for reconfig() in Kazoo (https://github.com/python-zk/kazoo/pull/333)
I found that the cluster can be crashed if you add an observer whose election port isn't reachable
(i.e.: packets for that destination are dropped, not rejected). This will raise a SocketTimeoutException
which will bring down the PrepRequestProcessor:
> {code}
> 2015-06-02 14:37:16,473 [myid:3] - WARN  [ProcessThread(sid:3 cport:-1)::QuorumCnxManager@384]
- Cannot open channel to 100 at election address /
> java.net.SocketTimeoutException: connect timed out
>         at java.net.PlainSocketImpl.socketConnect(Native Method)
>         at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:345)
>         at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
>         at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
>         at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
>         at java.net.Socket.connect(Socket.java:589)
>         at org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:369)
>         at org.apache.zookeeper.server.quorum.QuorumPeer.connectNewPeers(QuorumPeer.java:1288)
>         at org.apache.zookeeper.server.quorum.QuorumPeer.setLastSeenQuorumVerifier(QuorumPeer.java:1315)
>         at org.apache.zookeeper.server.quorum.Leader.propose(Leader.java:1056)
>         at org.apache.zookeeper.server.quorum.ProposalRequestProcessor.processRequest(ProposalRequestProcessor.java:78)
>         at org.apache.zookeeper.server.PrepRequestProcessor.pRequest(PrepRequestProcessor.java:877)
>         at org.apache.zookeeper.server.PrepRequestProcessor.run(PrepRequestProcessor.java:143)
> {code}
> A simple repro can be obtained by using the code in the referenced pull request above
and using (for example) instead of a free (but closed) port in the loopback.

> I think that adding an Observer (or a Participant) that isn't currently reachable is
a valid use case (i.e.: you are provisioning the machine and it's not currently needed) so
I think we could handle this with lower connect timeouts, not sure. 

This message was sent by Atlassian JIRA

View raw message