Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 2DBFF200BAA for ; Thu, 27 Oct 2016 14:49:40 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 2C543160AF6; Thu, 27 Oct 2016 12:49:40 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id EE046160AE4 for ; Thu, 27 Oct 2016 14:49:38 +0200 (CEST) Received: (qmail 55852 invoked by uid 500); 27 Oct 2016 12:49:33 -0000 Mailing-List: contact user-help@zookeeper.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@zookeeper.apache.org Delivered-To: mailing list user@zookeeper.apache.org Received: (qmail 55841 invoked by uid 99); 27 Oct 2016 12:49:32 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd1-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 27 Oct 2016 12:49:32 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id 41044C7D3F for ; Thu, 27 Oct 2016 12:49:32 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 2.442 X-Spam-Level: ** X-Spam-Status: No, score=2.442 tagged_above=-999 required=6.31 tests=[HTML_MESSAGE=2, HTML_OBFUSCATE_10_20=1.162, RCVD_IN_DNSWL_LOW=-0.7, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, SPF_PASS=-0.001, WEIRD_PORT=0.001] autolearn=disabled Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024) with ESMTP id 18AcYDnzWH6V for ; Thu, 27 Oct 2016 12:49:27 +0000 (UTC) Received: from mx0b-0019e102.pphosted.com (mx0b-0019e102.pphosted.com [67.231.157.237]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTPS id 4C5BC5F248 for ; Thu, 27 Oct 2016 12:49:27 +0000 (UTC) Received: from pps.filterd (m0074414.ppops.net [127.0.0.1]) by mx0b-0019e102.pphosted.com (8.16.0.17/8.16.0.17) with SMTP id u9RCfU4P011392 for ; Thu, 27 Oct 2016 07:49:25 -0500 Received: from mail-yb0-f198.google.com (mail-yb0-f198.google.com [209.85.213.198]) by mx0b-0019e102.pphosted.com with ESMTP id 26bfvc8bqt-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128 verify=NOT) for ; Thu, 27 Oct 2016 07:49:25 -0500 Received: by mail-yb0-f198.google.com with SMTP id f97so47606727ybi.7 for ; Thu, 27 Oct 2016 05:49:25 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to; bh=Auv5FeREkOcbWAyI8XZ+LifOYLGIyywvD9gyxCbsjEI=; b=C5gTp2dOCyzbv0gXLGkhOt3ol+Daem0U4oRFgl+kt3wE9khL4xiYW4tfdbN8jsoNF8 h5098o5HiXWGWr5c0dtTcnsd1FpzWNnULlLN2zWXd+1jUfJq++Jo7ci2IIIJZlP/jocW aKbJugLPFilvipK8mRqdrzl+IUk4BMauAsFaipsrpMA3HPcWB9TKtVQnnwcv3QhcuSQ3 lvXEV15FE6XCiw93reQWcg4QqW96jTh9oN2hwzKw436KwaZ5k9DgqX08jgQPf4DsJ+GK V2GdyVqQqDEIo4cBz91HTyuUMXBqUWorJv61Lnvy/cUj6w+Gl5Sx4AEEVzrgXKmDV8XR 7oXA== X-Gm-Message-State: ABUngvd/dKPx5sfw+pIni4A/uBNKDWDjqPT/k+NHLp1C3wTM4EsnyXPa2I+TMNcabs27fVHqBM3b7jr+zAsapTwc3WpsGnCqibWntob3k2fhD9d/Vnh56hYQnjM1OTVo3sKbOSqcMsIRMRAY7zjcpOkTjlsCmZQ= X-Received: by 10.36.1.79 with SMTP id 76mr9852347itk.123.1477572565159; Thu, 27 Oct 2016 05:49:25 -0700 (PDT) X-Received: by 10.36.1.79 with SMTP id 76mr9852326itk.123.1477572564715; Thu, 27 Oct 2016 05:49:24 -0700 (PDT) MIME-Version: 1.0 Received: by 10.79.34.227 with HTTP; Thu, 27 Oct 2016 05:49:23 -0700 (PDT) In-Reply-To: References: <5D093F64-7CB8-47A5-AC7A-2DF94D0C0251@apache.org> From: Krzysztof Rybak Date: Thu, 27 Oct 2016 14:49:23 +0200 Message-ID: Subject: Re: Zookeeper exception: Timeout while waiting for epoch from quorum To: user@zookeeper.apache.org Content-Type: multipart/alternative; boundary=001a113b89acd99d7f053fd82bb1 X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 priorityscore=1501 malwarescore=0 suspectscore=3 phishscore=0 bulkscore=0 spamscore=0 clxscore=1015 lowpriorityscore=0 impostorscore=0 adultscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1609300000 definitions=main-1610270213 archived-at: Thu, 27 Oct 2016 12:49:40 -0000 --001a113b89acd99d7f053fd82bb1 Content-Type: text/plain; charset=UTF-8 Information I provided wasn't complete as I didn't investigate the issue well regarding connect() function timeouts. Fortunately the issue seems to be resolved with changing cnxTimeout value in configuration file. Scenario is: zk1(follower) on lo:1 zk2(leader) on lo:2 zk3(follower) on lo:3 step 1. stop zk1 ifdown lo:1 step 2. stop zk2 ifdown lo:2 step 3. ifup lo:1 start zk1 result: zk1 and zk3 cannot create a cluster of 2 zk instances. When lo:2 is not down in step 2. the cluster is created (IP address of former leader). Observation: When IP address of lo:2 is unreachable sock.connect() throws exception 'connect timed out' after a configured period of time. When lo:2 is up, exception is thrown immediately with information 'connection refused': IP is reachable, but no listener on the port. Socket timeout is set to 5000 by default (cnxTO). Solution: Decreasing cnxTO by changing cnxTimeout in configuration file. Example value is 1000, the cluster is created. Question is: is it correct solution and is such a scenario supported by zookeeper? On Wed, Oct 26, 2016 at 2:57 PM, Krzysztof Rybak wrote: > I'm not sure about the ports, but the configuration is below. Firewall is > turned off, this shouldn't be a problem. > > Configuration file for zk1: > dataDir=/opt/zookeeper/zk_server_{1,2,3} > clientPort=1400{1,2,3} > > maxClientCnxns=0 > tickTime=2000 > > initLimit=5 > syncLimit=2 > server.1=10.54.1.51:12001:13001 > server.2=10.54.1.52:12001:13001 > server.3=10.54.1.53:12001:13001 > autopurge.snapRetainCount=3 > autopurge.purgeInterval=1 > dataLogDir=/opt/zookeeper/zk_log_{1,2,3} > > {1,2,3} means different values for instances: 1 for zk1, 2 for zk2 and 3 > for zk3. > 10.54.1.{51,52,53} are IP addresses on local interfaces, can be pinged > from any machine(problem is observed on one single machine indeed). > Increasing initLimit results with no success. > Also changing election and quorum port results with fail > ( > ... > server.1=10.54.1.51:12001:13001 > server.2=10.54.1.52:12002:13002 > server.3=10.54.1.53:12003:13003 > ... > ) > > Scenario is: > >>> zk1(follower) > >>> zk2(leader) > >>> zk3(follower) > >>> > >>> step 1. stop zk1 > >>> step 2. stop zk2 > >>> step 3. start zk1 > >>> > >>> result: > >>> zk1 and zk3 cannot create a cluster of 2 zk instances > > I found in a log that follower logs another time for leader election than > new leader (log is attached): > step 3. start of zk1: > > [2016-10-20 04:03:34,627] INFO Reading configuration from: > /opt/kafka/config/zookeeper.properties.1 (org.apache.zookeeper.server.q > uorum.QuorumPeerConfig) > > after around 10 seconds(!) follower logs time of leader election > [2016-10-20 04:03:44,991] INFO FOLLOWING - LEADER ELECTION TOOK - 10255 > (org.apache.zookeeper.server.quorum.Learner) > > Then follower tries to connect to leader on quorum server communication > port(12001): > > [2016-10-20 04:03:44,996] WARN Unexpected exception, tries=0, connecting > to /10.54.1.53:12001 (org.apache.zookeeper.server.quorum.Learner) > java.net.ConnectException: Connection refused > > After 4 tries(0,1,2,3) follower is shutdown: > > [2016-10-20 04:03:49,002] INFO shutdown called > (org.apache.zookeeper.server.quorum.Learner) > java.lang.Exception: shutdown Follower > > Then leader seems to resolve the election: > > [2016-10-20 04:03:49,764] INFO LEADING - LEADER ELECTION TOOK - 39709 > (org.apache.zookeeper.server.quorum.Leader) > > And starts listening on server communication port(12001). > > There are many logs informing that communication to zk2 fails > (10.54.1.52). This is correct as this instance is not active. > Before starting zk1 communication to zk1 fails ([2016-10-20 > 04:03:30,070]). This is correct as this instance is not yet active. > In correct scenario information LEADER ELECTION TOOK for leader is near > the log for the follower(a part of another log): > > [2016-10-20 02:48:47,675] INFO FOLLOWING - LEADER ELECTION TOOK - 15283 > (org.apache.zookeeper.server.quorum.Learner) > [2016-10-20 02:48:47,690] INFO LEADING - LEADER ELECTION TOOK - 174954 > (org.apache.zookeeper.server.quorum.Leader) > > regards, > Krzysztof > > On Tue, Oct 25, 2016 at 8:15 PM, Flavio Junqueira wrote: > >> There are two potential reasons I can think of: >> >> 1- The configuration is incorrect and it is pointing to the wrong port. >> Keep in mind that the election port is different from the quorum server >> communication port. >> 2- There is something block traffic on the configured port between those >> two servers. >> >> Could it be any of these? >> >> -Flavio >> >> > On 25 Oct 2016, at 14:33, Krzysztof Rybak < >> krzysztof.rybak@motorolasolutions.com> wrote: >> > >> > Hi, >> > first log is a little misleading as epoch exception is observed later >> in a >> > log and may be a consequence of lack of connection. >> > The first exception observed in failing scenario is: >> > >> > [2016-10-25 07:48:41,107] WARN Unexpected exception, tries=0, >> connecting to >> > /10.54.1.53:12001 (org.apache.zookeeper.server.quorum.Learner) >> > java.net.ConnectException: Connection refused >> > at java.net.PlainSocketImpl.socketConnect(Native Method) >> > at >> > java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSock >> etImpl.java:350) >> > at >> > java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPl >> ainSocketImpl.java:206) >> > at >> > java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocket >> Impl.java:188) >> > at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) >> > at java.net.Socket.connect(Socket.java:589) >> > at >> > org.apache.zookeeper.server.quorum.Learner.connectToLeader(L >> earner.java:225) >> > at >> > org.apache.zookeeper.server.quorum.Follower.followLeader(Fol >> lower.java:71) >> > at >> > org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:786) >> > >> > I think that instances are able to elect a Leader(zookeeper instance on >> > address 10.54.1.53, leader port 12001-2888 in zookeeper examples) as >> such >> > information is logged before exception: >> > >> > [2016-10-25 07:48:41,103] INFO FOLLOWING - LEADER ELECTION TOOK - 10249 >> > (org.apache.zookeeper.server.quorum.Learner) >> > Then follower tries to connect to leader but this fails. >> > >> > - If a leader is being elected but the leader isn't being able to sync >> and >> > establish itself, then you need to determine the reason. Is the follower >> > being able to connect to the leader? Is the initialization taking too >> long >> > such that it times out? Is the socket read timing out? >> > >> > This seems to be an issue that follower cannot connect to a leader on >> > leader port (12001 in my case). >> > Exception is related to connect() method: >> > >> > at >> > org.apache.zookeeper.server.quorum.Learner.connectToLeader(L >> earner.java:225) >> > >> > and the exception is >> > >> > java.net.ConnectException: Connection refused >> > >> > Follower tries to connect just after the leader stops listening on a >> leader >> > port (12001 in my case, 2888 in zookeeper examples). >> > I just don't know why follower starts trying to connect after around 10 >> > seconds after setting a new leader(during this 10 seconds leader >> listens on >> > a leader port). There is no timeout in socket connect as leader has >> already >> > stopped listening. >> > This seems to be a reason, but why this happens ? >> > >> > I assume this scenario is supported and server should be set. >> > >> >> [2016-10-20 04:04:19,996] WARN Unexpected exception, tries=0, >> connecting >> > to >> >> /10.54.1.53:12001 (org.apache.zookeeper.server.quorum.Learner) >> >> java.net.ConnectException: Connection refused >> >> at java.net.PlainSocketImpl.socketConnect(Native Method) >> >> at >> >> java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSock >> etImpl.java: >> > 350) >> >> at >> >> java.net.AbstractPlainSocketImpl.connectToAddress( >> > AbstractPlainSocketImpl.java:206) >> >> at >> >> java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocket >> Impl.java:188) >> >> at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) >> >> at java.net.Socket.connect(Socket.java:589) >> >> at >> >> org.apache.zookeeper.server.quorum.Learner.connectToLeader(L >> earner.java: >> > 225) >> >> at >> >> org.apache.zookeeper.server.quorum.Follower.followLeader(Fol >> lower.java:71) >> >> at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer >> .java:786) >> > >> > >> > >> > On Mon, Oct 24, 2016 at 6:57 PM, Flavio Junqueira >> wrote: >> > >> >> Hi Krzysztof, >> >> >> >> Your first message seems to indicate that a leader is being elected, >> but >> >> it is not being able to sync with enough followers (in your case 1 >> >> follower). I'm saying this because of this line: >> >> >> >>> at org.apache.zookeeper.server.quorum.Leader.lead(Leader.java:377) >> >> >> >> You say that the issue is timing in your second message, but it isn't >> >> entirely clear to me why you think so. I think you need to determine >> the >> >> following: >> >> >> >> - Are the two remaining servers being able to elect a leader among >> them? >> >> If not, is the problem that they can't connect to each other? >> >> - If a leader is being elected but the leader isn't being able to sync >> and >> >> establish itself, then you need to determine the reason. Is the >> follower >> >> being able to connect to the leader? Is the initialization taking too >> long >> >> such that it times out? Is the socket read timing out? >> >> >> >> -Flavio >> >> >> >> >> >>> On 24 Oct 2016, at 11:37, Krzysztof Rybak > >> motorolasolutions.com> wrote: >> >>> >> >>> Hi, >> >>> I've logged some more information and the problem seems to be a timing >> >>> issue. >> >>> Long story short, the scenario is: >> >>> zk1(follower) >> >>> zk2(leader) >> >>> zk3(follower) >> >>> >> >>> step 1. stop zk1 >> >>> step 2. stop zk2 >> >>> step 3. start zk1 >> >>> >> >>> result: >> >>> zk1 and zk3 cannot create a cluster of 2 zk instances(is such scenario >> >>> supported by zookeeper?). >> >>> When zk2 is started again instead of zk1, cluster is created. >> >>> >> >>> Observation: >> >>> After step 3., zk3 is elected to be a leader and starts listening on >> port >> >>> 2888 >> >>> ( example port from example >> >>> http://zookeeper.apache.org/doc/trunk/zookeeperStarted.html ). >> Listening >> >>> takes around 10 seconds but during this time zk1 is not trying to >> >> connect. >> >>> zk1 tries to connect just after zk3 stops listening. >> >>> >> >>> a part of a log is: >> >>> zk3 stops listening on 04:04:17(testing with netstat) and zk1 starts >> >> trying >> >>> to connect at 04:04:19,996 as below. >> >>> >> >>> [2016-10-20 04:04:19,996] WARN Unexpected exception, tries=0, >> connecting >> >> to >> >>> /10.54.1.53:12001 (org.apache.zookeeper.server.quorum.Learner) >> >>> java.net.ConnectException: Connection refused >> >>> at java.net.PlainSocketImpl.socketConnect(Native Method) >> >>> at >> >>> java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSock >> etImpl.java: >> >> 350) >> >>> at >> >>> java.net.AbstractPlainSocketImpl.connectToAddress( >> >> AbstractPlainSocketImpl.java:206) >> >>> at >> >>> java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocket >> Impl.java: >> >> 188) >> >>> at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) >> >>> at java.net.Socket.connect(Socket.java:589) >> >>> at >> >>> org.apache.zookeeper.server.quorum.Learner.connectToLeader(L >> earner.java: >> >> 225) >> >>> at >> >>> org.apache.zookeeper.server.quorum.Follower.followLeader( >> >> Follower.java:71) >> >>> at org.apache.zookeeper.server.quorum.QuorumPeer.run( >> >> QuorumPeer.java:786) >> >>> >> >>> This is strange as I couldn't reproduce this on local VM VirtualBox >> but >> >> on >> >>> VMware reproduction is 100%. >> >>> >> >>> thanks, >> >>> Krzysztof >> >>> >> >>> ---------- Forwarded message ---------- >> >>> From: Krzysztof Rybak >> >>> Date: Thu, Oct 20, 2016 at 4:50 PM >> >>> Subject: Zookeeper exception: Timeout while waiting for epoch from >> quorum >> >>> To: user@zookeeper.apache.org >> >>> >> >>> >> >>> Hi All, >> >>> first mail in the group so sorry for possible inconsistency in >> advance. >> >>> Zookeeper version is zookeeper-3.4.6. >> >>> >> >>> I'm facing a problem when zookeeper is reconfiguring a cluster. >> >>> >> >>> Initial state: >> >>> machine A: >> >>> zk1(follower) >> >>> zk2(leader) >> >>> machine B: >> >>> zk3(follower) >> >>> >> >>> zk1 and zk2 are stopped (in that order). >> >>> zk1 is started on machine B. >> >>> zk1 and zk3 are not creating a cluster, status is (using srvr word) >> >>> 'This ZooKeeper instance is not currently serving requests' >> >>> >> >>> A part of a log is: >> >>> [2016-10-20 04:03:10,053] WARN Unexpected exception >> >>> (org.apache.zookeeper.server.quorum.QuorumPeer) >> >>> java.lang.InterruptedException: Timeout while waiting for epoch from >> >> quorum >> >>> at org.apache.zookeeper.server.quorum.Leader. >> >> getEpochToPropose(Leader.java: >> >>> 878) >> >>> at org.apache.zookeeper.server.quorum.Leader.lead(Leader.java:377) >> >>> at org.apache.zookeeper.server.quorum.QuorumPeer.run( >> >> QuorumPeer.java:799) >> >>> [2016-10-20 04:03:10,054] INFO Shutting down >> >> (org.apache.zookeeper.server. >> >>> quorum.Leader) >> >>> [2016-10-20 04:03:10,054] INFO Shutdown called >> >> (org.apache.zookeeper.server. >> >>> quorum.Leader) >> >>> java.lang.Exception: shutdown Leader! reason: Forcing shutdown >> >>> at org.apache.zookeeper.server.quorum.Leader.shutdown(Leader.ja >> va:499) >> >>> at org.apache.zookeeper.server.quorum.QuorumPeer.run( >> >> QuorumPeer.java:805) >> >>> [2016-10-20 04:03:10,054] INFO shutting down >> >> (org.apache.zookeeper.server. >> >>> ZooKeeperServer) >> >>> >> >>> What is interesting: when zk2(previous leader) is started on machine B >> >>> (instead of zk1) cluster is configured correctly. >> >>> The same situation happens when all happen on the single machine. >> >>> >> >>> Issue is similar to this, but algorithm used by me is 3 (by default >> and >> >>> confirmed with electionAlg=3 in .cfg files) >> >>> https://issues.apache.org/jira/browse/ZOOKEEPER-2400 >> >>> >> >>> thanks, >> >>> Krzysztof >> >> >> >> >> >> > --001a113b89acd99d7f053fd82bb1--