Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 0A65A200BAD for ; Tue, 25 Oct 2016 15:34:11 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 09330160AF3; Tue, 25 Oct 2016 13:34:11 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 0467F160AE6 for ; Tue, 25 Oct 2016 15:34:09 +0200 (CEST) Received: (qmail 89556 invoked by uid 500); 25 Oct 2016 13:34:03 -0000 Mailing-List: contact user-help@zookeeper.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@zookeeper.apache.org Delivered-To: mailing list user@zookeeper.apache.org Received: (qmail 89536 invoked by uid 99); 25 Oct 2016 13:34:03 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd4-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 25 Oct 2016 13:34:03 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd4-us-west.apache.org (ASF Mail Server at spamd4-us-west.apache.org) with ESMTP id 27C85C09E8 for ; Tue, 25 Oct 2016 13:34:03 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd4-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 1.78 X-Spam-Level: * X-Spam-Status: No, score=1.78 tagged_above=-999 required=6.31 tests=[HTML_MESSAGE=2, RCVD_IN_DNSWL_LOW=-0.7, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, RCVD_IN_SORBS_SPAM=0.5, SPF_PASS=-0.001, WEIRD_PORT=0.001] autolearn=disabled Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd4-us-west.apache.org [10.40.0.11]) (amavisd-new, port 10024) with ESMTP id sg5sTyDnPHev for ; Tue, 25 Oct 2016 13:33:59 +0000 (UTC) Received: from mx0b-0019e102.pphosted.com (mx0a-0019e102.pphosted.com [67.231.149.242]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTPS id E2F915FBCD for ; Tue, 25 Oct 2016 13:33:58 +0000 (UTC) Received: from pps.filterd (m0074409.ppops.net [127.0.0.1]) by mx0a-0019e102.pphosted.com (8.16.0.17/8.16.0.17) with SMTP id u9PDWB01005802 for ; Tue, 25 Oct 2016 08:33:47 -0500 Received: from mail-qk0-f200.google.com (mail-qk0-f200.google.com [209.85.220.200]) by mx0a-0019e102.pphosted.com with ESMTP id 26a085h8ru-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128 verify=NOT) for ; Tue, 25 Oct 2016 08:33:47 -0500 Received: by mail-qk0-f200.google.com with SMTP id i34so69650707qkh.1 for ; Tue, 25 Oct 2016 06:33:46 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to; bh=PC/a+2+P6QIWZRnFLmLke8vSPDAJcM1IpxO3oy02pw8=; b=NXiYRxur09IgQBstTo+N2VrEg2F2FnyW75Begx+BOros/PqFn+Uz9jIr/OFb7fWLJE Udyd1ABHuvh247h0uVOsa+ylliNmolYgEV78uqxE1f5Ak4Iknz9mFAo/3DAHJCf4e3/4 3aR52NpSbCxorL9D/FNT0jzhE4//siCUi44rZbQvVeQlb/RAy4TvbaXCgTJfykD+Wpit uIkwCU2hVUL2n2z2sEWilzocBfDswCeZuJ7Jn8vfEY0E3VxjYND18BuM4PoqPGeHnz/B xhBHjMl0H8NbkMEpib2oUuZX3wBK0cx6HmP6eLqCXLGuPJv+mdRRH+tabEuX6REUU9eN qfzg== X-Gm-Message-State: ABUngvfxGLAGOi5WYKWMIXqGftrB6POr9N7WJaHM18YwdXxsMrgKEunnD/50eiB8ETChWl1MnpFuclyGwRlc7gjL2gmUX7oSc+41HNiropsUYF93nIJ+f0Lp17gm3kwP7E29G7ehUKQUL/gqE0BtuBX+PrOApYg= X-Received: by 10.107.175.93 with SMTP id y90mr16359345ioe.86.1477402425474; Tue, 25 Oct 2016 06:33:45 -0700 (PDT) X-Received: by 10.107.175.93 with SMTP id y90mr16359317ioe.86.1477402425121; Tue, 25 Oct 2016 06:33:45 -0700 (PDT) MIME-Version: 1.0 Received: by 10.79.40.2 with HTTP; Tue, 25 Oct 2016 06:33:44 -0700 (PDT) In-Reply-To: References: From: Krzysztof Rybak Date: Tue, 25 Oct 2016 15:33:44 +0200 Message-ID: Subject: Re: Zookeeper exception: Timeout while waiting for epoch from quorum To: user@zookeeper.apache.org Content-Type: multipart/alternative; boundary=001a11447fcebd6e96053fb08ef7 X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 priorityscore=1501 malwarescore=0 suspectscore=3 phishscore=0 bulkscore=0 spamscore=0 clxscore=1015 lowpriorityscore=0 impostorscore=0 adultscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1609300000 definitions=main-1610250227 archived-at: Tue, 25 Oct 2016 13:34:11 -0000 --001a11447fcebd6e96053fb08ef7 Content-Type: text/plain; charset=UTF-8 Hi, first log is a little misleading as epoch exception is observed later in a log and may be a consequence of lack of connection. The first exception observed in failing scenario is: [2016-10-25 07:48:41,107] WARN Unexpected exception, tries=0, connecting to /10.54.1.53:12001 (org.apache.zookeeper.server.quorum.Learner) java.net.ConnectException: Connection refused at java.net.PlainSocketImpl.socketConnect(Native Method) at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350) at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206) at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188) at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) at java.net.Socket.connect(Socket.java:589) at org.apache.zookeeper.server.quorum.Learner.connectToLeader(Learner.java:225) at org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:71) at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:786) I think that instances are able to elect a Leader(zookeeper instance on address 10.54.1.53, leader port 12001-2888 in zookeeper examples) as such information is logged before exception: [2016-10-25 07:48:41,103] INFO FOLLOWING - LEADER ELECTION TOOK - 10249 (org.apache.zookeeper.server.quorum.Learner) Then follower tries to connect to leader but this fails. - If a leader is being elected but the leader isn't being able to sync and establish itself, then you need to determine the reason. Is the follower being able to connect to the leader? Is the initialization taking too long such that it times out? Is the socket read timing out? This seems to be an issue that follower cannot connect to a leader on leader port (12001 in my case). Exception is related to connect() method: at org.apache.zookeeper.server.quorum.Learner.connectToLeader(Learner.java:225) and the exception is java.net.ConnectException: Connection refused Follower tries to connect just after the leader stops listening on a leader port (12001 in my case, 2888 in zookeeper examples). I just don't know why follower starts trying to connect after around 10 seconds after setting a new leader(during this 10 seconds leader listens on a leader port). There is no timeout in socket connect as leader has already stopped listening. This seems to be a reason, but why this happens ? I assume this scenario is supported and server should be set. > [2016-10-20 04:04:19,996] WARN Unexpected exception, tries=0, connecting to > /10.54.1.53:12001 (org.apache.zookeeper.server.quorum.Learner) > java.net.ConnectException: Connection refused > at java.net.PlainSocketImpl.socketConnect(Native Method) > at > java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java: 350) > at > java.net.AbstractPlainSocketImpl.connectToAddress( AbstractPlainSocketImpl.java:206) > at > java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188) > at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) > at java.net.Socket.connect(Socket.java:589) > at > org.apache.zookeeper.server.quorum.Learner.connectToLeader(Learner.java: 225) > at > org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:71) > at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:786) On Mon, Oct 24, 2016 at 6:57 PM, Flavio Junqueira wrote: > Hi Krzysztof, > > Your first message seems to indicate that a leader is being elected, but > it is not being able to sync with enough followers (in your case 1 > follower). I'm saying this because of this line: > > > at org.apache.zookeeper.server.quorum.Leader.lead(Leader.java:377) > > You say that the issue is timing in your second message, but it isn't > entirely clear to me why you think so. I think you need to determine the > following: > > - Are the two remaining servers being able to elect a leader among them? > If not, is the problem that they can't connect to each other? > - If a leader is being elected but the leader isn't being able to sync and > establish itself, then you need to determine the reason. Is the follower > being able to connect to the leader? Is the initialization taking too long > such that it times out? Is the socket read timing out? > > -Flavio > > > > On 24 Oct 2016, at 11:37, Krzysztof Rybak motorolasolutions.com> wrote: > > > > Hi, > > I've logged some more information and the problem seems to be a timing > > issue. > > Long story short, the scenario is: > > zk1(follower) > > zk2(leader) > > zk3(follower) > > > > step 1. stop zk1 > > step 2. stop zk2 > > step 3. start zk1 > > > > result: > > zk1 and zk3 cannot create a cluster of 2 zk instances(is such scenario > > supported by zookeeper?). > > When zk2 is started again instead of zk1, cluster is created. > > > > Observation: > > After step 3., zk3 is elected to be a leader and starts listening on port > > 2888 > > ( example port from example > > http://zookeeper.apache.org/doc/trunk/zookeeperStarted.html ). Listening > > takes around 10 seconds but during this time zk1 is not trying to > connect. > > zk1 tries to connect just after zk3 stops listening. > > > > a part of a log is: > > zk3 stops listening on 04:04:17(testing with netstat) and zk1 starts > trying > > to connect at 04:04:19,996 as below. > > > > [2016-10-20 04:04:19,996] WARN Unexpected exception, tries=0, connecting > to > > /10.54.1.53:12001 (org.apache.zookeeper.server.quorum.Learner) > > java.net.ConnectException: Connection refused > > at java.net.PlainSocketImpl.socketConnect(Native Method) > > at > > java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java: > 350) > > at > > java.net.AbstractPlainSocketImpl.connectToAddress( > AbstractPlainSocketImpl.java:206) > > at > > java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java: > 188) > > at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) > > at java.net.Socket.connect(Socket.java:589) > > at > > org.apache.zookeeper.server.quorum.Learner.connectToLeader(Learner.java: > 225) > > at > > org.apache.zookeeper.server.quorum.Follower.followLeader( > Follower.java:71) > > at org.apache.zookeeper.server.quorum.QuorumPeer.run( > QuorumPeer.java:786) > > > > This is strange as I couldn't reproduce this on local VM VirtualBox but > on > > VMware reproduction is 100%. > > > > thanks, > > Krzysztof > > > > ---------- Forwarded message ---------- > > From: Krzysztof Rybak > > Date: Thu, Oct 20, 2016 at 4:50 PM > > Subject: Zookeeper exception: Timeout while waiting for epoch from quorum > > To: user@zookeeper.apache.org > > > > > > Hi All, > > first mail in the group so sorry for possible inconsistency in advance. > > Zookeeper version is zookeeper-3.4.6. > > > > I'm facing a problem when zookeeper is reconfiguring a cluster. > > > > Initial state: > > machine A: > > zk1(follower) > > zk2(leader) > > machine B: > > zk3(follower) > > > > zk1 and zk2 are stopped (in that order). > > zk1 is started on machine B. > > zk1 and zk3 are not creating a cluster, status is (using srvr word) > > 'This ZooKeeper instance is not currently serving requests' > > > > A part of a log is: > > [2016-10-20 04:03:10,053] WARN Unexpected exception > > (org.apache.zookeeper.server.quorum.QuorumPeer) > > java.lang.InterruptedException: Timeout while waiting for epoch from > quorum > > at org.apache.zookeeper.server.quorum.Leader. > getEpochToPropose(Leader.java: > > 878) > > at org.apache.zookeeper.server.quorum.Leader.lead(Leader.java:377) > > at org.apache.zookeeper.server.quorum.QuorumPeer.run( > QuorumPeer.java:799) > > [2016-10-20 04:03:10,054] INFO Shutting down > (org.apache.zookeeper.server. > > quorum.Leader) > > [2016-10-20 04:03:10,054] INFO Shutdown called > (org.apache.zookeeper.server. > > quorum.Leader) > > java.lang.Exception: shutdown Leader! reason: Forcing shutdown > > at org.apache.zookeeper.server.quorum.Leader.shutdown(Leader.java:499) > > at org.apache.zookeeper.server.quorum.QuorumPeer.run( > QuorumPeer.java:805) > > [2016-10-20 04:03:10,054] INFO shutting down > (org.apache.zookeeper.server. > > ZooKeeperServer) > > > > What is interesting: when zk2(previous leader) is started on machine B > > (instead of zk1) cluster is configured correctly. > > The same situation happens when all happen on the single machine. > > > > Issue is similar to this, but algorithm used by me is 3 (by default and > > confirmed with electionAlg=3 in .cfg files) > > https://issues.apache.org/jira/browse/ZOOKEEPER-2400 > > > > thanks, > > Krzysztof > > --001a11447fcebd6e96053fb08ef7--