zookeeper-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Andor Molnar (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (ZOOKEEPER-2982) Re-try DNS hostname -> IP resolution
Date Wed, 21 Feb 2018 13:54:00 GMT

    [ https://issues.apache.org/jira/browse/ZOOKEEPER-2982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16371425#comment-16371425
] 

Andor Molnar commented on ZOOKEEPER-2982:
-----------------------------------------

[~eronwright]

I've tried this on localhost by adding fake dns names to /etc/hosts like this:
{noformat}
127.0.0.1 one.andor.org
127.0.0.1 two.andor.org
#127.0.0.1 three.andor.org{noformat}
First, all of the 3 entries were commented out and I started ZooKeeper nodes with the following
server config:
{noformat}
server.1=one.andor.org:2222:2223
server.2=two.andor.org:3333:3334
server.3=three.andor.org:4444:4445
{noformat}
Nodes were unable to connect because of the following resolution error:
{noformat}
2018-02-21 14:33:25,509 [myid:1] - WARN [QuorumPeer[myid=1](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):QuorumPeer$QuorumServer@172]
- Failed to resolve address: two.andor.org
java.net.UnknownHostException: two.andor.org
at java.net.InetAddress.getAllByName0(InetAddress.java:1273)
at java.net.InetAddress.getAllByName(InetAddress.java:1185)
at java.net.InetAddress.getAllByName(InetAddress.java:1119)
at java.net.InetAddress.getByName(InetAddress.java:1069)
at org.apache.zookeeper.server.quorum.QuorumPeer$QuorumServer.recreateSocketAddresses(QuorumPeer.java:170)
at org.apache.zookeeper.server.quorum.QuorumPeer.recreateSocketAddresses(QuorumPeer.java:726)
at org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:686)
at org.apache.zookeeper.server.quorum.QuorumCnxManager.connectAll(QuorumCnxManager.java:720)
at org.apache.zookeeper.server.quorum.FastLeaderElection.lookForLeader(FastLeaderElection.java:919)
at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:1171){noformat}
Similar entries are keep repeated in both server logs. As I can see ZK is trying to call recreateSocketAddresses()
and tries to re-resolve the address every time it's trying to connect.

This is the case _without_ your patch.

When I enabled the entries in /etc/hosts, the following error showed up in the logs:
{noformat}
2018-02-21 14:37:07,178 [myid:1] - WARN [QuorumPeer[myid=1](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):QuorumCnxManager@663]
- Cannot open channel to 2 at election address two.andor.org/127.0.0.1:3334
java.net.ConnectException: Connection refused (Connection refused)
at java.net.PlainSocketImpl.socketConnect(Native Method)
at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339)
at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200)
at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182)
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
at java.net.Socket.connect(Socket.java:580)
at org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:641)
at org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:692)
at org.apache.zookeeper.server.quorum.QuorumCnxManager.connectAll(QuorumCnxManager.java:720)
at org.apache.zookeeper.server.quorum.FastLeaderElection.lookForLeader(FastLeaderElection.java:919)
at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:1171){noformat}
The error shows that DNS resolution was successful (127.0.0.1) and the connection issue is
different (Connection refused) which might be related to my silly test environment (socket
has not been created on the other side), but the key takeaway here is that [~abrahamfine]
is probably right and the re-resolution happens properly.

I repeated the test with your patch too and the results are the same. No difference.

Maybe I'm missing something and the test might not be relevant at all, but at least it's a
little bit confusing.

[~eronwright]Would you please attach logs running the same ensemble _without_ your patch
too?

> Re-try DNS hostname -> IP resolution
> ------------------------------------
>
>                 Key: ZOOKEEPER-2982
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2982
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: server
>    Affects Versions: 3.5.0, 3.5.1, 3.5.3
>            Reporter: Eron Wright 
>            Priority: Blocker
>             Fix For: 3.5.4, 3.6.0
>
>         Attachments: fixed.log
>
>
> ZOOKEEPER-1506 fixed a DNS resolution issue in 3.4.  Some portions of the fix haven't
yet been ported to 3.5.
> To recap the outstanding problem in 3.5, if a given ZK server is started before all peer
addresses are resolvable, that server may cache a negative lookup result and forever fail
to resolve the address.    For example, deploying ZK 3.5 to Kubernetes using a StatefulSet
plus a Service (headless) may fail because the DNS records are created lazily.
> {code}
> 2018-02-18 09:11:22,583 [myid:0] - WARN  [QuorumPeer[myid=0](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):Follower@95]
- Exception when following the leader
> java.net.UnknownHostException: zk-2.zk.default.svc.cluster.local
>         at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:184)
>         at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
>         at java.net.Socket.connect(Socket.java:589)
>         at org.apache.zookeeper.server.quorum.Learner.sockConnect(Learner.java:227)
>         at org.apache.zookeeper.server.quorum.Learner.connectToLeader(Learner.java:256)
>         at org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:76)
>         at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:1133)
> {code}
> In the above example, the address `zk-2.zk.default.svc.cluster.local` was not resolvable
when the server started, but became resolvable shortly thereafter.    The server should eventually
succeed but doesn't.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message