Return-Path: X-Original-To: apmail-helix-user-archive@minotaur.apache.org Delivered-To: apmail-helix-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 6205E17F7C for ; Sat, 2 May 2015 20:40:45 +0000 (UTC) Received: (qmail 88426 invoked by uid 500); 2 May 2015 20:40:45 -0000 Delivered-To: apmail-helix-user-archive@helix.apache.org Received: (qmail 88379 invoked by uid 500); 2 May 2015 20:40:45 -0000 Mailing-List: contact user-help@helix.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@helix.apache.org Delivered-To: mailing list user@helix.apache.org Received: (qmail 88368 invoked by uid 99); 2 May 2015 20:40:45 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 02 May 2015 20:40:45 +0000 X-ASF-Spam-Status: No, hits=2.2 required=5.0 tests=HTML_MESSAGE,SPF_PASS,WEIRD_PORT X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: message received from 54.76.25.247 which is an MX secondary for user@helix.apache.org) Received: from [54.76.25.247] (HELO mx1-eu-west.apache.org) (54.76.25.247) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 02 May 2015 20:40:19 +0000 Received: from mail-qk0-f174.google.com (mail-qk0-f174.google.com [209.85.220.174]) by mx1-eu-west.apache.org (ASF Mail Server at mx1-eu-west.apache.org) with ESMTPS id D4850203A2 for ; Sat, 2 May 2015 20:40:17 +0000 (UTC) Received: by qkx62 with SMTP id 62so67238378qkx.0 for ; Sat, 02 May 2015 13:39:32 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=tE+569q6Arr8f4rpj4E9rRYcl+txBla683/dQtvL6Xc=; b=Ez8t67arAQszEGh3tTUHuNokcby5mSzC7ynT4jeTjncX45SCC3eJbpE6Si6JOHX4Az FOG9qkw7p5dQIQoWQl+M0akbgzZLqO7RoQ6oj/RDA1M8hLlIWBNjipOybHqTasyMeE4K H+VuCRAkc6refJz44pG1PoxUi4wtbYkJVYGzA79GML+jCb0XS33Itc3c/Y9sBzyEhDJr Dc5GWQuTmJCWP1SlIeZ69A9DhL70WnaI0fy4K6LY52EYtS2XcBEVhtfFJhk6f6smli+T C3SjZggW0vP7nlTrj+YpGv79siM1c3/1rR0t1udGjwqhsuniAWmfyVbaISELeQ5zcVDm zdNQ== MIME-Version: 1.0 X-Received: by 10.55.42.32 with SMTP id q32mr31473840qkh.62.1430599171931; Sat, 02 May 2015 13:39:31 -0700 (PDT) Received: by 10.140.19.82 with HTTP; Sat, 2 May 2015 13:39:31 -0700 (PDT) In-Reply-To: References: Date: Sat, 2 May 2015 13:39:31 -0700 Message-ID: Subject: Re: ZooKeeper disconnects on controller From: Zhen Zhang To: "user@helix.apache.org" Content-Type: multipart/alternative; boundary=001a1149419e75857d05151f5346 X-Virus-Checked: Checked by ClamAV on apache.org --001a1149419e75857d05151f5346 Content-Type: text/plain; charset=UTF-8 you may also check zookeeper log to see if there is any error/exception messages On Sat, May 2, 2015 at 1:08 PM, kishore g wrote: > Is zookeeper quorum working fine?. Can you run each stat| nc zkhost zkPort > for each zk server and paste the output. > On May 2, 2015 1:02 PM, "Varun Sharma" wrote: > >> We are also seeing that all our machines (participants and controller) >> are connecting to the same zookeeper machine which is rather weird - it >> also makes it hard to scale up traffic via observers. Is the following the >> right way to pass the zookeeper string (with comma separation): >> >> zk001:2181, zk002:2181,zk003:2181 >> >> Thanks >> Varun >> >> On Fri, May 1, 2015 at 3:32 PM, Varun Sharma wrote: >> >>> Hi, >>> >>> We are seeing zookeeper disconnects on the controller and the controller >>> gets into a state from which it cannot reconnect back. We see messages like >>> the ones below over and over again. It keeps trying to re-establish >>> connections against the same session ID and keeps failing. On the other >>> hand, the participants see one hiccup while in their zookeeper connection >>> but gracefully reconnect back. What would cause the controller to keep >>> retrying but failing to connect even after the zookeeper comes back to a >>> healthy state ? >>> >>> 2015-05-01 20:47:02,865 [main-SendThread(terrapinzk001a:2181)] >>> (ClientCnxn.java:1061) INFO Opening socket connection to server >>> terrapinzk001a/10.115.59.31:2181 >>> >>> 2015-05-01 20:47:02,866 [main-SendThread(terrapinzk001a:2181)] >>> (ClientCnxn.java:950) INFO Socket connection established to terrapinzk001a/ >>> 10.115.59.31:2181, initiating session >>> >>> 2015-05-01 20:47:02,880 [main-SendThread(terrapinzk001a:2181)] >>> (ClientCnxn.java:739) INFO Session establishment complete on server >>> terrapinzk001a/10.115.59.31:2181, sessionid = 0x14d111892390023, >>> negotiated timeout = 30000 >>> >>> 2015-05-01 20:47:02,884 [main-EventThread] (ZkClient.java:449) INFO >>> zookeeper state changed (SyncConnected) >>> >>> 2015-05-01 20:47:02,884 [main-SendThread(terrapinzk001a:2181)] >>> (ClientCnxn.java:1186) INFO Unable to read additional data from server >>> sessionid 0x14d111892390023, likely server has closed socket, closing >>> socket connection and attempting reconnect >>> >>> 2015-05-01 20:47:02,988 [main-EventThread] (ZkClient.java:449) INFO >>> zookeeper state changed (Disconnected) >>> >> >> --001a1149419e75857d05151f5346 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
you may also check zookeeper log to see if there is any er= ror/exception messages

On Sat, May 2, 2015 at 1:08 PM, kishore g <= g.kishore@gmail.co= m> wrote:

Is= zookeeper quorum working fine?. Can you run each stat| nc zkhost zkPort fo= r each zk server and paste the output.

On May 2, 2015 1:02 PM, "Varun Sharma"= <varun@pintere= st.com> wrote:
We are also seeing that all our machines (participants a= nd controller) are connecting to the same zookeeper machine which is rather= weird - it also makes it hard to scale up traffic via observers. Is the fo= llowing the right way to pass the zookeeper string (with comma separation):=

zk001:2181, zk002:2181,zk003:2181

<= div>Thanks
Varun

On Fri, May 1, 2015 at 3:32 PM, Varun Sharma <varun= @pinterest.com> wrote:
Hi,

We are seeing zookeeper disconnects on = the controller and the controller gets into a state from which it cannot re= connect back. We see messages like the ones below over and over again. It k= eeps trying to re-establish connections against the same session ID and kee= ps failing. On the other hand, the participants see one hiccup while in the= ir zookeeper connection but gracefully reconnect back. What would cause the= controller to keep retrying but failing to connect even after the zookeepe= r comes back to a healthy state ?

2015-05-01 20:47:02,865 [main-SendThread(terrapinzk001a:2181)] (ClientCn= xn.java:1061) INFO=C2=A0 Opening socket connection to server terrapinzk001a= /10.115.59.31:2181

2015-05-01 20:47:02,866 [main-SendThread(terrapinzk001a:2181)] (ClientCn= xn.java:950) INFO=C2=A0 Socket connection established to terrapinzk001a/10.115.59.31:2181, = initiating session

2015-05-01 20:47:02,880 [main-SendThread(terrapinzk001a:2181)] (ClientCn= xn.java:739) INFO=C2=A0 Session establishment complete on server terrapinzk= 001a/10.115.59.31:21= 81, sessionid =3D 0x14d111892390023, negotiated timeout =3D 30000

2015-05-01 20:47:02,884 [main-EventThread] (ZkClient.java:449) INFO=C2= =A0 zookeeper state changed (SyncConnected)

2015-05-01 20:47:02,884 [main-SendThread(terrapinzk001a:2181)] (ClientCn= xn.java:1186) INFO=C2=A0 Unable to read additional data from server session= id 0x14d111892390023, likely server has closed socket, closing socket conne= ction and attempting reconnect

2015-05-01 20:47:02,988 [main-EventThread] (ZkClient.java:449) INFO=C2= =A0 zookeeper state changed (Disconnected)



--001a1149419e75857d05151f5346--