zookeeper-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Krzysztof Rybak <krzysztof.ry...@motorolasolutions.com>
Subject Re: Zookeeper exception: Timeout while waiting for epoch from quorum
Date Thu, 27 Oct 2016 12:49:23 GMT
Information I provided wasn't complete as I didn't investigate the issue
well regarding connect() function timeouts.
Fortunately the issue seems to be resolved with changing cnxTimeout value
in configuration file.

Scenario is:

zk1(follower) on lo:1
zk2(leader) on lo:2
zk3(follower) on lo:3

step 1.

stop zk1

ifdown lo:1

step 2.

stop zk2
ifdown lo:2

step 3.

ifup lo:1

start zk1


result:
zk1 and zk3 cannot create a cluster of 2 zk instances.
When lo:2 is not down in step 2. the cluster is created (IP address of
former leader).

Observation:
When IP address of lo:2 is unreachable sock.connect() throws exception
'connect timed out' after a configured period of time. When lo:2 is up,
exception is thrown immediately with information 'connection refused': IP
is reachable, but no listener on the port. Socket timeout is set to 5000 by
default (cnxTO).

Solution:
Decreasing cnxTO by changing cnxTimeout in configuration file. Example
value is 1000, the cluster is created.

Question is: is it correct solution and is such a scenario supported by
zookeeper?

On Wed, Oct 26, 2016 at 2:57 PM, Krzysztof Rybak <krzysztof.rybak@
motorolasolutions.com> wrote:

> I'm not sure about the ports, but the configuration is below. Firewall is
> turned off, this shouldn't be a problem.
>
> Configuration file for zk1:
> dataDir=/opt/zookeeper/zk_server_{1,2,3}
> clientPort=1400{1,2,3}
>
> maxClientCnxns=0
> tickTime=2000
>
> initLimit=5
> syncLimit=2
> server.1=10.54.1.51:12001:13001
> server.2=10.54.1.52:12001:13001
> server.3=10.54.1.53:12001:13001
> autopurge.snapRetainCount=3
> autopurge.purgeInterval=1
> dataLogDir=/opt/zookeeper/zk_log_{1,2,3}
>
> {1,2,3} means different values for instances: 1 for zk1, 2 for zk2 and 3
> for zk3.
> 10.54.1.{51,52,53} are IP addresses on local interfaces, can be pinged
> from any machine(problem is observed on one single machine indeed).
> Increasing initLimit results with no success.
> Also changing election and quorum port results with fail
> (
> ...
> server.1=10.54.1.51:12001:13001
> server.2=10.54.1.52:12002:13002
> server.3=10.54.1.53:12003:13003
> ...
> )
>
> Scenario is:
> >>> zk1(follower)
> >>> zk2(leader)
> >>> zk3(follower)
> >>>
> >>> step 1. stop zk1
> >>> step 2. stop zk2
> >>> step 3. start zk1
> >>>
> >>> result:
> >>> zk1 and zk3 cannot create a cluster of 2 zk instances
>
> I found in a log that follower logs another time for leader election than
> new leader (log is attached):
> step 3. start of zk1:
>
> [2016-10-20 04:03:34,627] INFO Reading configuration from:
> /opt/kafka/config/zookeeper.properties.1 (org.apache.zookeeper.server.q
> uorum.QuorumPeerConfig)
>
> after around 10 seconds(!) follower logs time of leader election
> [2016-10-20 04:03:44,991] INFO FOLLOWING - LEADER ELECTION TOOK - 10255
> (org.apache.zookeeper.server.quorum.Learner)
>
> Then follower tries to connect to leader on quorum server communication
> port(12001):
>
> [2016-10-20 04:03:44,996] WARN Unexpected exception, tries=0, connecting
> to /10.54.1.53:12001 (org.apache.zookeeper.server.quorum.Learner)
> java.net.ConnectException: Connection refused
>
> After 4 tries(0,1,2,3) follower is shutdown:
>
> [2016-10-20 04:03:49,002] INFO shutdown called
> (org.apache.zookeeper.server.quorum.Learner)
> java.lang.Exception: shutdown Follower
>
> Then leader seems to resolve the election:
>
> [2016-10-20 04:03:49,764] INFO LEADING - LEADER ELECTION TOOK - 39709
> (org.apache.zookeeper.server.quorum.Leader)
>
> And starts listening on server communication port(12001).
>
> There are many logs informing that communication to zk2 fails
> (10.54.1.52). This is correct as this instance is not active.
> Before starting zk1 communication to zk1 fails ([2016-10-20
> 04:03:30,070]). This is correct as this instance is not yet active.
> In correct scenario information LEADER ELECTION TOOK for leader is near
> the log for the follower(a part of another log):
>
> [2016-10-20 02:48:47,675] INFO FOLLOWING - LEADER ELECTION TOOK - 15283
> (org.apache.zookeeper.server.quorum.Learner)
> [2016-10-20 02:48:47,690] INFO LEADING - LEADER ELECTION TOOK - 174954
> (org.apache.zookeeper.server.quorum.Leader)
>
> regards,
> Krzysztof
>
> On Tue, Oct 25, 2016 at 8:15 PM, Flavio Junqueira <fpj@apache.org> wrote:
>
>> There are two potential reasons I can think of:
>>
>> 1- The configuration is incorrect and it is pointing to the wrong port.
>> Keep in mind that the election port is different from the quorum server
>> communication port.
>> 2- There is something block traffic on the configured port between those
>> two servers.
>>
>> Could it be any of these?
>>
>> -Flavio
>>
>> > On 25 Oct 2016, at 14:33, Krzysztof Rybak <
>> krzysztof.rybak@motorolasolutions.com> wrote:
>> >
>> > Hi,
>> > first log is a little misleading as epoch exception is observed later
>> in a
>> > log and may be a consequence of lack of connection.
>> > The first exception observed in failing scenario is:
>> >
>> > [2016-10-25 07:48:41,107] WARN Unexpected exception, tries=0,
>> connecting to
>> > /10.54.1.53:12001 (org.apache.zookeeper.server.quorum.Learner)
>> > java.net.ConnectException: Connection refused
>> >        at java.net.PlainSocketImpl.socketConnect(Native Method)
>> >        at
>> > java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSock
>> etImpl.java:350)
>> >        at
>> > java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPl
>> ainSocketImpl.java:206)
>> >        at
>> > java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocket
>> Impl.java:188)
>> >        at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
>> >        at java.net.Socket.connect(Socket.java:589)
>> >        at
>> > org.apache.zookeeper.server.quorum.Learner.connectToLeader(L
>> earner.java:225)
>> >        at
>> > org.apache.zookeeper.server.quorum.Follower.followLeader(Fol
>> lower.java:71)
>> >        at
>> > org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:786)
>> >
>> > I think that instances are able to elect a Leader(zookeeper instance on
>> > address 10.54.1.53, leader port 12001-2888 in zookeeper examples) as
>> such
>> > information is logged before exception:
>> >
>> > [2016-10-25 07:48:41,103] INFO FOLLOWING - LEADER ELECTION TOOK - 10249
>> > (org.apache.zookeeper.server.quorum.Learner)
>> > Then follower tries to connect to leader but this fails.
>> >
>> > - If a leader is being elected but the leader isn't being able to sync
>> and
>> > establish itself, then you need to determine the reason. Is the follower
>> > being able to connect to the leader? Is the initialization taking too
>> long
>> > such that it times out? Is the socket read timing out?
>> >
>> > This seems to be an issue that follower cannot connect to a leader on
>> > leader port (12001 in my case).
>> > Exception is related to connect() method:
>> >
>> > at
>> > org.apache.zookeeper.server.quorum.Learner.connectToLeader(L
>> earner.java:225)
>> >
>> > and the exception is
>> >
>> > java.net.ConnectException: Connection refused
>> >
>> > Follower tries to connect just after the leader stops listening on a
>> leader
>> > port (12001 in my case, 2888 in zookeeper examples).
>> > I just don't know why follower starts trying to connect after around 10
>> > seconds after setting a new leader(during this 10 seconds leader
>> listens on
>> > a leader port). There is no timeout in socket connect as leader has
>> already
>> > stopped listening.
>> > This seems to be a reason, but why this happens ?
>> >
>> > I assume this scenario is supported and server should be set.
>> >
>> >> [2016-10-20 04:04:19,996] WARN Unexpected exception, tries=0,
>> connecting
>> > to
>> >> /10.54.1.53:12001 (org.apache.zookeeper.server.quorum.Learner)
>> >> java.net.ConnectException: Connection refused
>> >> at java.net.PlainSocketImpl.socketConnect(Native Method)
>> >> at
>> >> java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSock
>> etImpl.java:
>> > 350)
>> >> at
>> >> java.net.AbstractPlainSocketImpl.connectToAddress(
>> > AbstractPlainSocketImpl.java:206)
>> >> at
>> >> java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocket
>> Impl.java:188)
>> >> at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
>> >> at java.net.Socket.connect(Socket.java:589)
>> >> at
>> >> org.apache.zookeeper.server.quorum.Learner.connectToLeader(L
>> earner.java:
>> > 225)
>> >> at
>> >> org.apache.zookeeper.server.quorum.Follower.followLeader(Fol
>> lower.java:71)
>> >> at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer
>> .java:786)
>> >
>> >
>> >
>> > On Mon, Oct 24, 2016 at 6:57 PM, Flavio Junqueira <fpj@apache.org>
>> wrote:
>> >
>> >> Hi Krzysztof,
>> >>
>> >> Your first message seems to indicate that a leader is being elected,
>> but
>> >> it is not being able to sync with enough followers (in your case 1
>> >> follower). I'm saying this because of this line:
>> >>
>> >>> at org.apache.zookeeper.server.quorum.Leader.lead(Leader.java:377)
>> >>
>> >> You say that the issue is timing in your second message, but it isn't
>> >> entirely clear to me why you think so. I think you need to determine
>> the
>> >> following:
>> >>
>> >> - Are the two remaining servers being able to elect a leader among
>> them?
>> >> If not, is the problem that they can't connect to each other?
>> >> - If a leader is being elected but the leader isn't being able to sync
>> and
>> >> establish itself, then you need to determine the reason. Is the
>> follower
>> >> being able to connect to the leader? Is the initialization taking too
>> long
>> >> such that it times out? Is the socket read timing out?
>> >>
>> >> -Flavio
>> >>
>> >>
>> >>> On 24 Oct 2016, at 11:37, Krzysztof Rybak <krzysztof.rybak@
>> >> motorolasolutions.com> wrote:
>> >>>
>> >>> Hi,
>> >>> I've logged some more information and the problem seems to be a timing
>> >>> issue.
>> >>> Long story short, the scenario is:
>> >>> zk1(follower)
>> >>> zk2(leader)
>> >>> zk3(follower)
>> >>>
>> >>> step 1. stop zk1
>> >>> step 2. stop zk2
>> >>> step 3. start zk1
>> >>>
>> >>> result:
>> >>> zk1 and zk3 cannot create a cluster of 2 zk instances(is such scenario
>> >>> supported by zookeeper?).
>> >>> When zk2 is started again instead of zk1, cluster is created.
>> >>>
>> >>> Observation:
>> >>> After step 3., zk3 is elected to be a leader and starts listening on
>> port
>> >>> 2888
>> >>> ( example port from example
>> >>> http://zookeeper.apache.org/doc/trunk/zookeeperStarted.html ).
>> Listening
>> >>> takes around 10 seconds but during this time zk1 is not trying to
>> >> connect.
>> >>> zk1 tries to connect just after zk3 stops listening.
>> >>>
>> >>> a part of a log is:
>> >>> zk3 stops listening on 04:04:17(testing with netstat) and zk1 starts
>> >> trying
>> >>> to connect at 04:04:19,996 as below.
>> >>>
>> >>> [2016-10-20 04:04:19,996] WARN Unexpected exception, tries=0,
>> connecting
>> >> to
>> >>> /10.54.1.53:12001 (org.apache.zookeeper.server.quorum.Learner)
>> >>> java.net.ConnectException: Connection refused
>> >>> at java.net.PlainSocketImpl.socketConnect(Native Method)
>> >>> at
>> >>> java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSock
>> etImpl.java:
>> >> 350)
>> >>> at
>> >>> java.net.AbstractPlainSocketImpl.connectToAddress(
>> >> AbstractPlainSocketImpl.java:206)
>> >>> at
>> >>> java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocket
>> Impl.java:
>> >> 188)
>> >>> at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
>> >>> at java.net.Socket.connect(Socket.java:589)
>> >>> at
>> >>> org.apache.zookeeper.server.quorum.Learner.connectToLeader(L
>> earner.java:
>> >> 225)
>> >>> at
>> >>> org.apache.zookeeper.server.quorum.Follower.followLeader(
>> >> Follower.java:71)
>> >>> at org.apache.zookeeper.server.quorum.QuorumPeer.run(
>> >> QuorumPeer.java:786)
>> >>>
>> >>> This is strange as I couldn't reproduce this on local VM VirtualBox
>> but
>> >> on
>> >>> VMware reproduction is 100%.
>> >>>
>> >>> thanks,
>> >>> Krzysztof
>> >>>
>> >>> ---------- Forwarded message ----------
>> >>> From: Krzysztof Rybak <krzysztof.rybak@motorolasolutions.com>
>> >>> Date: Thu, Oct 20, 2016 at 4:50 PM
>> >>> Subject: Zookeeper exception: Timeout while waiting for epoch from
>> quorum
>> >>> To: user@zookeeper.apache.org
>> >>>
>> >>>
>> >>> Hi All,
>> >>> first mail in the group so sorry for possible inconsistency in
>> advance.
>> >>> Zookeeper version is zookeeper-3.4.6.
>> >>>
>> >>> I'm facing a problem when zookeeper is reconfiguring a cluster.
>> >>>
>> >>> Initial state:
>> >>> machine A:
>> >>> zk1(follower)
>> >>> zk2(leader)
>> >>> machine B:
>> >>> zk3(follower)
>> >>>
>> >>> zk1 and zk2 are stopped (in that order).
>> >>> zk1 is started on machine B.
>> >>> zk1 and zk3 are not creating a cluster, status is (using srvr word)
>> >>> 'This ZooKeeper instance is not currently serving requests'
>> >>>
>> >>> A part of a log is:
>> >>> [2016-10-20 04:03:10,053] WARN Unexpected exception
>> >>> (org.apache.zookeeper.server.quorum.QuorumPeer)
>> >>> java.lang.InterruptedException: Timeout while waiting for epoch from
>> >> quorum
>> >>> at org.apache.zookeeper.server.quorum.Leader.
>> >> getEpochToPropose(Leader.java:
>> >>> 878)
>> >>> at org.apache.zookeeper.server.quorum.Leader.lead(Leader.java:377)
>> >>> at org.apache.zookeeper.server.quorum.QuorumPeer.run(
>> >> QuorumPeer.java:799)
>> >>> [2016-10-20 04:03:10,054] INFO Shutting down
>> >> (org.apache.zookeeper.server.
>> >>> quorum.Leader)
>> >>> [2016-10-20 04:03:10,054] INFO Shutdown called
>> >> (org.apache.zookeeper.server.
>> >>> quorum.Leader)
>> >>> java.lang.Exception: shutdown Leader! reason: Forcing shutdown
>> >>> at org.apache.zookeeper.server.quorum.Leader.shutdown(Leader.ja
>> va:499)
>> >>> at org.apache.zookeeper.server.quorum.QuorumPeer.run(
>> >> QuorumPeer.java:805)
>> >>> [2016-10-20 04:03:10,054] INFO shutting down
>> >> (org.apache.zookeeper.server.
>> >>> ZooKeeperServer)
>> >>>
>> >>> What is interesting: when zk2(previous leader) is started on machine
B
>> >>> (instead of zk1) cluster is configured correctly.
>> >>> The same situation happens when all happen on the single machine.
>> >>>
>> >>> Issue is similar to this, but algorithm used by me is 3 (by default
>> and
>> >>> confirmed with electionAlg=3 in .cfg files)
>> >>> https://issues.apache.org/jira/browse/ZOOKEEPER-2400
>> >>>
>> >>> thanks,
>> >>> Krzysztof
>> >>
>> >>
>>
>>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message