Return-Path: X-Original-To: apmail-zookeeper-user-archive@www.apache.org Delivered-To: apmail-zookeeper-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id E420117FEA for ; Thu, 24 Sep 2015 13:36:23 +0000 (UTC) Received: (qmail 71358 invoked by uid 500); 24 Sep 2015 13:36:23 -0000 Delivered-To: apmail-zookeeper-user-archive@zookeeper.apache.org Received: (qmail 71311 invoked by uid 500); 24 Sep 2015 13:36:23 -0000 Mailing-List: contact user-help@zookeeper.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@zookeeper.apache.org Delivered-To: mailing list user@zookeeper.apache.org Received: (qmail 71300 invoked by uid 99); 24 Sep 2015 13:36:23 -0000 Received: from Unknown (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 24 Sep 2015 13:36:23 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id 96A80181125 for ; Thu, 24 Sep 2015 13:36:22 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -0.019 X-Spam-Level: X-Spam-Status: No, score=-0.019 tagged_above=-999 required=6.31 tests=[HEADER_FROM_DIFFERENT_DOMAINS=0.001, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=disabled Received: from mx1-us-west.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id AHQa7CoG6T9L for ; Thu, 24 Sep 2015 13:36:10 +0000 (UTC) Received: from mail-pa0-f45.google.com (mail-pa0-f45.google.com [209.85.220.45]) by mx1-us-west.apache.org (ASF Mail Server at mx1-us-west.apache.org) with ESMTPS id 2E2CD20864 for ; Thu, 24 Sep 2015 13:36:10 +0000 (UTC) Received: by pacgz1 with SMTP id gz1so6967066pac.3 for ; Thu, 24 Sep 2015 06:36:10 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=content-type:mime-version:subject:from:in-reply-to:date :content-transfer-encoding:message-id:references:to; bh=ZYrhtXTmr7gqveQFTKismyHp5XrR+B9oOsF+6JOiy7I=; b=UVRIZAzu2A8qI5VCYBsSbLjjEp2BvYvUrhIsl5dvFz8G/tzoHZiDG0kSCN7nx9rped zDrNAXjqGVVfmWWvI95CMG/FNad6lpflI0qfncUzLL2P7f1taeuVQjZ3V0cdgOmfOd3h J0+mAuyWjVbNfcRs+s5kjT062O8cJ2mni9xehoWtILwCIj7jJZ9OLR5SZWiTXe3qeO2n BS/4BG6Je0QpALriYgmllJ6DneKIOpb0FWQh6GhZirEFvm8K0gSg8mxsdZUUih4pBdNH XKAt11LD9VXXft3n7mNhVHe7lyInjzP7KPfXBcQ12WIz0uDwpTNxbuBYvq5MAPynB2K0 TxrQ== X-Received: by 10.66.160.194 with SMTP id xm2mr21303105pab.111.1443101769861; Thu, 24 Sep 2015 06:36:09 -0700 (PDT) Received: from [192.168.88.84] (c-73-15-177-216.hsd1.ca.comcast.net. [73.15.177.216]) by smtp.gmail.com with ESMTPSA id j16sm13841617pbq.23.2015.09.24.06.36.09 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Thu, 24 Sep 2015 06:36:09 -0700 (PDT) Content-Type: text/plain; charset=utf-8 Mime-Version: 1.0 (Mac OS X Mail 8.2 \(2104\)) Subject: Re: Unstable work of zookeeper From: Flavio Junqueira In-Reply-To: <568AD511-F4AD-42EF-A964-DADCB0B0DAAC@icloud.com> Date: Thu, 24 Sep 2015 06:36:07 -0700 Content-Transfer-Encoding: quoted-printable Message-Id: References: <568AD511-F4AD-42EF-A964-DADCB0B0DAAC@icloud.com> To: user@zookeeper.apache.org X-Mailer: Apple Mail (2.2104) I can see that the client is disconnecting from the server, and there is = also a new round of leader election for the zookeeper servers. If this = is happening, then yeah, your ensemble is unstable. If the ensemble = leader election is being triggered frequently, then I'd start by looking = there. Try to determine why the ensemble is failing to continue with the = same leader. If ensemble elections aren't happening frequently, then = another possibility is that GC pauses are causing the session to expire. -Flavio > On 24 Sep 2015, at 05:24, Akmal Abbasov = wrote: >=20 > Hi, > I am using zookeeper 3.4.6 > I have a spark cluster configured with HA. Once per 1-2 days, the = active spark master is shutting down with a message > 15/09/23 18:58:18 INFO zookeeper.ClientCnxn: Unable to read additional = data from server sessionid 0x34ffa68dbd10021, likely server has closed = socket, closing socket connection and attempting reconnect > 15/09/23 18:58:18 INFO state.ConnectionStateManager: State change: = SUSPENDED > 15/09/23 18:58:18 INFO master.ZooKeeperLeaderElectionAgent: We have = lost leadership > 15/09/23 18:58:18 ERROR master.Master: Leadership has been revoked -- = master shutting down. > 15/09/23 18:58:18 INFO util.Utils: Shutdown hook called >=20 > I don=E2=80=99t have the zookeeper logs from the same period, but the = logs are full of the these messages=20 > 2015-09-24 05:07:42,228 [myid:1] - INFO = [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@197] - = Accepted socket connection from /10.0.8.4:34705 > 2015-09-24 05:07:42,229 [myid:1] - WARN = [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer@822] - = Connection request from old client /10.0.8.4:34705; will be dropped if = server is in r-o mode > 2015-09-24 05:07:42,229 [myid:1] - INFO = [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer@868] - Client = attempting to establish new session at /10.0.8.4:34705 > 2015-09-24 05:07:42,292 [myid:1] - INFO = [CommitProcessor:1:ZooKeeperServer@617] - Established session = 0x14ffd3670130030 with negotiated timeout 20001 for client = /10.0.8.4:34705 > 2015-09-24 05:07:42,302 [myid:1] - WARN = [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@357] - caught = end of stream exception > EndOfStreamException: Unable to read additional data from client = sessionid 0x14ffd3670130030, likely client has closed socket > at = org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:228) > at = org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.= java:208) > at java.lang.Thread.run(Thread.java:745) > 2015-09-24 05:07:42,303 [myid:1] - INFO = [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1007] - Closed = socket connection for client /10.0.8.4:34705 which had sessionid = 0x14ffd3670130030 > 2015-09-24 05:07:42,314 [myid:1] - ERROR = [CommitProcessor:1:NIOServerCnxn@178] - Unexpected Exception: > java.nio.channels.CancelledKeyException > at = sun.nio.ch.SelectionKeyImpl.ensureValid(SelectionKeyImpl.java:73) > at = sun.nio.ch.SelectionKeyImpl.interestOps(SelectionKeyImpl.java:77) > at = org.apache.zookeeper.server.NIOServerCnxn.sendBuffer(NIOServerCnxn.java:15= 1) > at = org.apache.zookeeper.server.NIOServerCnxn.sendResponse(NIOServerCnxn.java:= 1081) > at = org.apache.zookeeper.server.FinalRequestProcessor.processRequest(FinalRequ= estProcessor.java:404) > at = org.apache.zookeeper.server.quorum.CommitProcessor.run(CommitProcessor.jav= a:74) > 2015-09-24 05:07:42,334 [myid:1] - INFO = [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@197] - = Accepted socket connection from /10.0.8.4:34707 > 2015-09-24 05:07:42,334 [myid:1] - WARN = [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer@822] - = Connection request from old client /10.0.8.4:34707; will be dropped if = server is in r-o mode > 2015-09-24 05:07:42,335 [myid:1] - INFO = [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer@868] - Client = attempting to establish new session at /10.0.8.4:34707 > 2015-09-24 05:07:42,357 [myid:1] - INFO = [CommitProcessor:1:ZooKeeperServer@617] - Established session = 0x14ffd3670130031 with negotiated timeout 20001 for client = /10.0.8.4:34707 > 2015-09-24 05:07:42,364 [myid:1] - WARN = [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@357] - caught = end of stream exception > EndOfStreamException: Unable to read additional data from client = sessionid 0x14ffd3670130031, likely client has closed socket > at = org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:228) > at = org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.= java:208) > at java.lang.Thread.run(Thread.java:745) > 2015-09-24 05:07:42,365 [myid:1] - INFO = [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1007] - Closed = socket connection for client /10.0.8.4:34707 which had sessionid = 0x14ffd3670130031 > 2015-09-24 05:07:42,376 [myid:1] - ERROR = [CommitProcessor:1:NIOServerCnxn@178] - Unexpected Exception: > java.nio.channels.CancelledKeyException > at = sun.nio.ch.SelectionKeyImpl.ensureValid(SelectionKeyImpl.java:73) > at = sun.nio.ch.SelectionKeyImpl.interestOps(SelectionKeyImpl.java:77) > at = org.apache.zookeeper.server.NIOServerCnxn.sendBuffer(NIOServerCnxn.java:15= 1) > at = org.apache.zookeeper.server.NIOServerCnxn.sendResponse(NIOServerCnxn.java:= 1081) > at = org.apache.zookeeper.server.FinalRequestProcessor.processRequest(FinalRequ= estProcessor.java:404) > at = org.apache.zookeeper.server.quorum.CommitProcessor.run(CommitProcessor.jav= a:74) >=20 > Also there are=20 > 2015-09-24 06:29:54,459 [myid:1] - INFO = [QuorumPeer[myid=3D1]/0:0:0:0:0:0:0:0:2181:FollowerZooKeeperServer@139] = - Shutting down > 2015-09-24 06:29:54,459 [myid:1] - INFO = [QuorumPeer[myid=3D1]/0:0:0:0:0:0:0:0:2181:ZooKeeperServer@441] - = shutting down > 2015-09-24 06:29:54,459 [myid:1] - INFO = [QuorumPeer[myid=3D1]/0:0:0:0:0:0:0:0:2181:FollowerRequestProcessor@105] = - Shutting down > 2015-09-24 06:29:54,459 [myid:1] - INFO = [FollowerRequestProcessor:1:FollowerRequestProcessor@95] - = FollowerRequestProcessor exited loop! > 2015-09-24 06:29:54,460 [myid:1] - INFO = [QuorumPeer[myid=3D1]/0:0:0:0:0:0:0:0:2181:CommitProcessor@181] - = Shutting down > 2015-09-24 06:29:54,464 [myid:1] - INFO = [CommitProcessor:1:CommitProcessor@150] - CommitProcessor exited loop! > 2015-09-24 06:29:54,465 [myid:1] - INFO = [QuorumPeer[myid=3D1]/0:0:0:0:0:0:0:0:2181:FinalRequestProcessor@415] - = shutdown of request processor complete > 2015-09-24 06:29:54,466 [myid:1] - INFO = [QuorumPeer[myid=3D1]/0:0:0:0:0:0:0:0:2181:SyncRequestProcessor@209] - = Shutting down > 2015-09-24 06:29:54,466 [myid:1] - INFO = [SyncThread:1:SyncRequestProcessor@187] - SyncRequestProcessor exited! > 2015-09-24 06:29:54,466 [myid:1] - INFO = [QuorumPeer[myid=3D1]/0:0:0:0:0:0:0:0:2181:QuorumPeer@714] - LOOKING > 2015-09-24 06:29:54,584 [myid:1] - INFO = [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@197] - = Accepted socket connection from /10.0.8.58:36137 > 2015-09-24 06:29:54,584 [myid:1] - WARN = [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - = Exception causing close of session 0x0 due to java.io.IOException: = ZooKeeperServer not running > 2015-09-24 06:29:54,584 [myid:1] - INFO = [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1007] - Closed = socket connection for client /10.0.8.58:36137 (no session established = for client) > 2015-09-24 06:29:54,679 [myid:1] - INFO = [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@197] - = Accepted socket connection from /10.0.8.57:57410 > 2015-09-24 06:29:54,680 [myid:1] - WARN = [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - = Exception causing close of session 0x0 due to java.io.IOException: = ZooKeeperServer not running > 2015-09-24 06:29:54,680 [myid:1] - INFO = [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1007] - Closed = socket connection for client /10.0.8.57:57410 (no session established = for client) >=20 > I also observed that hadoop-zkfc restarts very frequently. > Any ideas what could be wrong? >=20 > Thanks.