Return-Path: Delivered-To: apmail-hadoop-zookeeper-user-archive@minotaur.apache.org Received: (qmail 81292 invoked from network); 12 Jan 2010 23:44:37 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 12 Jan 2010 23:44:37 -0000 Received: (qmail 82679 invoked by uid 500); 12 Jan 2010 23:44:36 -0000 Delivered-To: apmail-hadoop-zookeeper-user-archive@hadoop.apache.org Received: (qmail 82630 invoked by uid 500); 12 Jan 2010 23:44:36 -0000 Mailing-List: contact zookeeper-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: zookeeper-user@hadoop.apache.org Delivered-To: mailing list zookeeper-user@hadoop.apache.org Received: (qmail 82613 invoked by uid 99); 12 Jan 2010 23:44:36 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 12 Jan 2010 23:44:36 +0000 X-ASF-Spam-Status: No, hits=-1.0 required=10.0 tests=RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of nickb@mailtrust.com designates 207.97.245.161 as permitted sender) Received: from [207.97.245.161] (HELO smtp161.iad.emailsrvr.com) (207.97.245.161) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 12 Jan 2010 23:44:28 +0000 Received: from relay26.relay.iad.mlsrvr.com (localhost [127.0.0.1]) by relay26.relay.iad.mlsrvr.com (SMTP Server) with ESMTP id 2C5091B4034 for ; Tue, 12 Jan 2010 18:44:07 -0500 (EST) Received: from dynamic4.wm-web.iad.mlsrvr.com (dynamic4.wm-web.iad.mlsrvr.com [192.168.2.153]) by relay26.relay.iad.mlsrvr.com (SMTP Server) with ESMTP id 26DE11B402B for ; Tue, 12 Jan 2010 18:44:07 -0500 (EST) Received: from mailtrust.com (localhost [127.0.0.1]) by dynamic4.wm-web.iad.mlsrvr.com (Postfix) with ESMTP id 192331D4806E for ; Tue, 12 Jan 2010 18:44:07 -0500 (EST) Received: by apps.rackspace.com (Authenticated sender: nickb@mailtrust.com, from: nickb@mailtrust.com) with HTTP; Tue, 12 Jan 2010 18:44:07 -0500 (EST) Date: Tue, 12 Jan 2010 18:44:07 -0500 (EST) Subject: Re: Killing a zookeeper server From: "Nick Bailey" To: zookeeper-user@hadoop.apache.org MIME-Version: 1.0 Content-Type: text/plain;charset=UTF-8 Content-Transfer-Encoding: quoted-printable Importance: Normal X-Priority: 3 (Normal) X-Type: plain In-Reply-To: <1263337419.674422615@192.168.2.230> References: <1263333167.308920598@192.168.2.230> <4B4CFA57.2030008@apache.org> <1263337419.674422615@192.168.2.230> Message-ID: <1263339847.100820837@192.168.2.230> X-Mailer: webmail7.0b X-Virus-Checked: Checked by ClamAV on apache.org In my last email I failded to include a log line that may be revelent as we= ll=0A=0A2010-01-12 18:33:10,658 [QuorumPeer:/0.0.0.0:2181] (QuorumCnxManage= r) DEBUG - Queue size: 0=0A2010-01-12 18:33:10,659 [QuorumPeer:/0.0.0.0:218= 1] (FastLeaderElection) INFO - Notification time out: 6400=0A=0AWe see thi= s line occur frequently and the timeout will graduatlly increase to 60000. = It appears that all of our servers that seem to be acting normally are expe= riencing the cpu issue I mentioned earlier 'https://issues.apache.org/jira/= browse/ZOOKEEPER-427'. Perhaps that is causing the timeout in responding?= =0A=0AAlso to answer your other questions Patrick, we aren't storing a larg= e amount of data really and network latency appears fine.=0A=0AThanks for t= he help,=0ANick=0A=0A=0A-----Original Message-----=0AFrom: "Nick Bailey" =0ASent: Tuesday, January 12, 2010 6:03pm=0ATo= : zookeeper-user@hadoop.apache.org=0ASubject: Re: Killing a zookeeper serve= r=0A=0A12 was just to keep uniformity on our servers. Our clients are conne= cting from the same 12 servers. Easily modifiable and perhaps we should lo= ok into changing that.=0A=0A=0AThe logs just seem to indicate that the serv= ers that claim to have no server running are continually attempting to elec= t a leader. A sample is provided below. The initial exception is somethin= g we see regularly in our logs and the debug and info lines following are s= imply repeating throughout the log.=0A=0A=0A2010-01-12 17:55:02,269 [NIOSer= verCxn.Factory:2181] (NIOServerCnxn) WARN - Exception causing close of ses= sion 0x0 due to java.io.IOException: Read error=0A2010-01-12 17:55:02,269 [= NIOServerCxn.Factory:2181] (NIOServerCnxn) DEBUG - IOException stack trace= =0Ajava.io.IOException: Read error=0A at org.apache.zookeeper.server= .NIOServerCnxn.doIO(NIOServerCnxn.java:295)=0A at org.apache.zookeep= er.server.NIOServerCnxn$Factory.run(NIOServerCnxn.java:162)=0A2010-01-12 17= :55:02,269 [NIOServerCxn.Factory:2181] (NIOServerCnxn) INFO - closing sess= ion:0x0 NIOServerCnxn: java.nio.channels.SocketChannel[connected local=3D/1= 72.20.36.9:2181 remote=3D/172.20.36.9:50367]=0A2010-01-12 17:55:02,270 [NIO= ServerCxn.Factory:2181] (NIOServerCnxn) DEBUG - ignoring exception during i= nput shutdown=0Ajava.net.SocketException: Transport endpoint is not connect= ed=0A at sun.nio.ch.SocketChannelImpl.shutdown(Native Method)=0A = at sun.nio.ch.SocketChannelImpl.shutdownInput(SocketChannelImpl.java:64= 0)=0A at sun.nio.ch.SocketAdaptor.shutdownInput(SocketAdaptor.java:3= 60)=0A at org.apache.zookeeper.server.NIOServerCnxn.close(NIOServerC= nxn.java:767)=0A at org.apache.zookeeper.server.NIOServerCnxn.doIO(N= IOServerCnxn.java:421)=0A at org.apache.zookeeper.server.NIOServerCn= xn$Factory.run(NIOServerCnxn.java:162)=0A=0A2010-01-12 17:55:52,181 [Quorum= Peer:/0.0.0.0:2181] (FastLeaderElection) INFO - Adding vote=0A2010-01-12 1= 7:55:52,181 [QuorumPeer:/0.0.0.0:2181] (FastLeaderElection) INFO - Notific= ation: 3, 30064826826, 1, 9, LOOKING, LOOKING, 9=0A2010-01-12 17:55:52,181 = [QuorumPeer:/0.0.0.0:2181] (FastLeaderElection) DEBUG - id: 3, proposed id:= 3, zxid: 30064826826, proposed zxid: 30064826826=0A2010-01-12 17:55:52,181= [QuorumPeer:/0.0.0.0:2181] (FastLeaderElection) INFO - Adding vote=0A2010= -01-12 17:55:52,182 [QuorumPeer:/0.0.0.0:2181] (FastLeaderElection) INFO -= Notification: 3, 30064826826, 1, 9, LOOKING, LOOKING, 11=0A2010-01-12 17:5= 5:52,182 [QuorumPeer:/0.0.0.0:2181] (FastLeaderElection) DEBUG - id: 3, pro= posed id: 3, zxid: 30064826826, proposed zxid: 30064826826=0A2010-01-12 17:= 55:52,182 [QuorumPeer:/0.0.0.0:2181] (FastLeaderElection) INFO - Adding vo= te=0A2010-01-12 17:55:52,183 [QuorumPeer:/0.0.0.0:2181] (FastLeaderElection= ) INFO - Notification: 3, 30064826826, 1, 9, LOOKING, LOOKING, 12=0A2010-0= 1-12 17:55:52,183 [QuorumPeer:/0.0.0.0:2181] (FastLeaderElection) DEBUG - i= d: 3, proposed id: 3, zxid: 30064826826, proposed zxid: 30064826826=0A2010-= 01-12 17:55:52,183 [QuorumPeer:/0.0.0.0:2181] (FastLeaderElection) INFO - = Adding vote=0A2010-01-12 17:56:52,190 [WorkerReceiver Thread] (FastLeaderEl= ection$Messenger$WorkerReceiver) DEBUG - Receive new message.=0A2010-01-12 = 17:56:52,190 [WorkerReceiver Thread] (FastLeaderElection$Messenger$WorkerRe= ceiver) DEBUG - Receive new message.=0A2010-01-12 17:56:52,191 [QuorumPeer:= /0.0.0.0:2181] (FastLeaderElection) INFO - Notification: 3, 30064826826, 1= , 9, LOOKING, LOOKING, 8=0A2010-01-12 17:56:52,191 [QuorumPeer:/0.0.0.0:218= 1] (FastLeaderElection) DEBUG - id: 3, proposed id: 3, zxid: 30064826826, p= roposed zxid: 30064826826=0A2010-01-12 17:56:52,191 [QuorumPeer:/0.0.0.0:21= 81] (FastLeaderElection) INFO - Adding vote=0A2010-01-12 17:56:52,192 [Quo= rumPeer:/0.0.0.0:2181] (FastLeaderElection) INFO - Notification: 3, 300648= 26826, 1, 9, LOOKING, LOOKING, 12=0A2010-01-12 17:56:52,192 [QuorumPeer:/0.= 0.0.0:2181] (FastLeaderElection) DEBUG - id: 3, proposed id: 3, zxid: 30064= 826826, proposed zxid: 30064826826=0A2010-01-12 17:56:52,192 [QuorumPeer:/0= .0.0.0:2181] (FastLeaderElection) INFO - Adding vote=0A2010-01-12 17:57:52= ,200 [WorkerReceiver Thread] (FastLeaderElection$Messenger$WorkerReceiver) = DEBUG - Receive new message.=0A2010-01-12 17:57:52,201 [WorkerReceiver Thre= ad] (FastLeaderElection$Messenger$WorkerReceiver) DEBUG - Receive new messa= ge.=0A=0A=0A-----Original Message-----=0AFrom: "Patrick Hunt" =0ASent: Tuesday, January 12, 2010 5:40pm=0ATo: zookeeper-user@hadoop.= apache.org, nicholas.bailey@rackspace.com=0ASubject: Re: Killing a zookeepe= r server=0A=0A12 servers? That's alot, if you dont' mind my asking why so m= any? =0ATypically we recommend 5 - that way you can have one down for =0Ama= intenance and still have a failure that doesn't bring down the cluster.=0A= =0AThe "electing a leader" is probably the restarted machine attempting to = =0Are-join the ensemble (it should join as a follower if you have a leader = =0Aalready elected, given that it's xid is behind the existing leader.) =0A= Hard to tell though without the logs.=0A=0AYou might also be seeing the ini= tLimit exceeded, is the data you are =0Astoring in ZK large? Or perhaps net= work connectivity is slow?=0Ahttp://hadoop.apache.org/zookeeper/docs/curren= t/zookeeperAdmin.html#sc_clusterOptions=0Aagain the logs would give some in= sight on this.=0A=0A=0APatrick=0A=0ANick Bailey wrote:=0A> We are running z= ookeeper 3.1.0=0A> =0A> Recently we noticed the cpu usage on our machines b= ecoming=0A> increasingly high and we believe the cause is=0A> =0A> https://= issues.apache.org/jira/browse/ZOOKEEPER-427=0A> =0A> However our solution w= hen we noticed the problem was to kill the=0A> zookeeper process and restar= t it.=0A> =0A> After doing that though it looks like the newly restarted zo= okeeper=0A> server is continually attempting to elect a leader even though = one=0A> already exists.=0A> =0A> The process responses with 'imok' when ask= ed, but the stat command=0A> returns 'ZooKeeperServer not running'.=0A> =0A= > I belive that killing the current leader should trigger all servers=0A> t= o do an election and solve the problem, but I'm not sure. Should=0A> that b= e the course of action in this situation?=0A> =0A> Also we have 12 servers,= but 5 are currently not running according to=0A> stat. So I guess this is= n't a problem unless we lose another one.=0A> We have plans to upgrade zook= eeper to solve the cpu issue but haven't=0A> been able to do that yet.=0A> = =0A> Any help appreciated, Nick Bailey=0A> =0A=0A=0A