zookeeper-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Abraham Fine (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (ZOOKEEPER-2982) Re-try DNS hostname -> IP resolution
Date Sat, 24 Feb 2018 01:39:00 GMT

    [ https://issues.apache.org/jira/browse/ZOOKEEPER-2982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16375213#comment-16375213

Abraham Fine commented on ZOOKEEPER-2982:

[~fpj] I believe your diagnosis to be correct and I agree that [~eronwright]'s fix would solve
the problem in the case that DNS eventually is fixed. My concern with the current solution
is that it could cause us to jump back and forth between leader election and the quorum when
the DNS stays in a bad state. For example, imagine a 3 node cluster {z1, z2, z3}. z3 is always
offline and z2 has no entry in dns. z2 will connect to z1 and win the leader election. When
it comes time to form the quorum z1 will be unable to follow z2 as it wont be able to resolve
its address.

Just spitballing here, but what if we had z1 connect to the {{remoteSocketAddress}} of the
socket created from the connection it received in {{QuorumCnxManager}}? I understand there
are some security concerns here and I'm not sure how much we care about that since they would
be stifled by Kerberos. We could also do a reverse dns lookup and reject the connection if
the reverse lookup does not align with our expected hostname. 

What do you guys think?

> Re-try DNS hostname -> IP resolution
> ------------------------------------
>                 Key: ZOOKEEPER-2982
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2982
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: server
>    Affects Versions: 3.5.0, 3.5.1, 3.5.3
>            Reporter: Eron Wright 
>            Priority: Blocker
>             Fix For: 3.5.4, 3.6.0
>         Attachments: 3.5.3-beta.zip, fixed.log
> ZOOKEEPER-1506 fixed a DNS resolution issue in 3.4.  Some portions of the fix haven't
yet been ported to 3.5.
> To recap the outstanding problem in 3.5, if a given ZK server is started before all peer
addresses are resolvable, that server may cache a negative lookup result and forever fail
to resolve the address.    For example, deploying ZK 3.5 to Kubernetes using a StatefulSet
plus a Service (headless) may fail because the DNS records are created lazily.
> {code}
> 2018-02-18 09:11:22,583 [myid:0] - WARN  [QuorumPeer[myid=0](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):Follower@95]
- Exception when following the leader
> java.net.UnknownHostException: zk-2.zk.default.svc.cluster.local
>         at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:184)
>         at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
>         at java.net.Socket.connect(Socket.java:589)
>         at org.apache.zookeeper.server.quorum.Learner.sockConnect(Learner.java:227)
>         at org.apache.zookeeper.server.quorum.Learner.connectToLeader(Learner.java:256)
>         at org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:76)
>         at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:1133)
> {code}
> In the above example, the address `zk-2.zk.default.svc.cluster.local` was not resolvable
when the server started, but became resolvable shortly thereafter.    The server should eventually
succeed but doesn't.

This message was sent by Atlassian JIRA

View raw message