Return-Path: Delivered-To: apmail-hadoop-zookeeper-user-archive@minotaur.apache.org Received: (qmail 81611 invoked from network); 18 Mar 2010 01:34:02 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 18 Mar 2010 01:34:02 -0000 Received: (qmail 90642 invoked by uid 500); 18 Mar 2010 01:34:02 -0000 Delivered-To: apmail-hadoop-zookeeper-user-archive@hadoop.apache.org Received: (qmail 90625 invoked by uid 500); 18 Mar 2010 01:34:02 -0000 Mailing-List: contact zookeeper-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: zookeeper-user@hadoop.apache.org Delivered-To: mailing list zookeeper-user@hadoop.apache.org Received: (qmail 90617 invoked by uid 99); 18 Mar 2010 01:34:02 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 18 Mar 2010 01:34:02 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=10.0 tests=AWL,MIME_QP_LONG_LINE,SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (athena.apache.org: local policy) Received: from [69.147.107.21] (HELO mrout2-b.corp.re1.yahoo.com) (69.147.107.21) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 18 Mar 2010 01:33:56 +0000 Received: from SNV-EXPF01.ds.corp.yahoo.com (snv-expf01.ds.corp.yahoo.com [207.126.227.250]) by mrout2-b.corp.re1.yahoo.com (8.13.8/8.13.8/y.out) with ESMTP id o2I1WqTp015017 for ; Wed, 17 Mar 2010 18:32:52 -0700 (PDT) DomainKey-Signature: a=rsa-sha1; s=serpent; d=yahoo-inc.com; c=nofws; q=dns; h=received:user-agent:date:subject:from:to:message-id: thread-topic:thread-index:in-reply-to:mime-version:content-type: content-transfer-encoding:x-originalarrivaltime; b=1ZHXkeWJlzN63K5jGkpuA7K1qDoHDOEXYveJIsYIfrab9ZYQI2SUiRzDbVFBYJT5 Received: from SNV-EXVS09.ds.corp.yahoo.com ([207.126.227.86]) by SNV-EXPF01.ds.corp.yahoo.com with Microsoft SMTPSVC(6.0.3790.3959); Wed, 17 Mar 2010 18:32:51 -0700 Received: from 10.73.146.106 ([10.73.146.106]) by SNV-EXVS09.ds.corp.yahoo.com ([207.126.227.84]) via Exchange Front-End Server snv-webmail.corp.yahoo.com ([207.126.227.60]) with Microsoft Exchange Server HTTP-DAV ; Thu, 18 Mar 2010 01:32:24 +0000 User-Agent: Microsoft-Entourage/12.20.0.090605 Date: Wed, 17 Mar 2010 18:32:23 -0700 Subject: Re: permanent ZSESSIONMOVED From: Mahadev Konar To: Message-ID: Thread-Topic: permanent ZSESSIONMOVED Thread-Index: AcrGOtu1+eVxZQsHMUib6XHUt2ZsRg== In-Reply-To: <3b910d891003170256v1facb2ceveef5791cf0b5ed64@mail.gmail.com> Mime-version: 1.0 Content-type: text/plain; charset="ISO-8859-2" Content-transfer-encoding: quoted-printable X-OriginalArrivalTime: 18 Mar 2010 01:32:51.0851 (UTC) FILETIME=[ECE7FDB0:01CAC63A] Hi Lukasz, I looked at the logs and havent been able to determine how the session move can happen.=20 You should go ahead and open a jira for this! Please attach the logs to that jira. Mark it for 3.3.0 until we can say that it isnt a problem as of now.=20 Also, please go ahead and attach non filtered logs to the jira for the same timeline as you had done earlier (by non filtered I mean not grepping for the session id). Also, do attach the earlier logs you had emailed. Also, may I know what function are you using to print that statement Mar 16 22:09:01 ggmail-c1-mbox-1 ggmail_mbox[7976]: ERROR: ZooKeeper: ZooKeeper::getChildren() path=3D'/ggmboxlocks/16314'; callback error=3D'session moved to another server, so operation is ignored' (If you can cut and paste that code on the jira that you create, it would b= e helpful!)...=20 Thanks mahadev On 3/17/10 2:56 AM, "=A3ukasz Osipiuk" wrote: > Ok. >=20 > I analysed some logs from time when problem last occurred. >=20 > brokensession.txt - logs concerning session which had a problem > leader.txt - logs concerning leader election (i grepped for Leader in > zookeeper.log) >=20 > Some observations below > - Network glitch which resulted in problem occurred at about 22:08. > - From what I see since 17:48 node2 was the leader and it did not > change later yesterday. > - Client was connected to node2 since 17:50 > - At around 22:09 client tried to connect to every node (1,2,3). > Connections to node1 and node3 were closed > with exception "Exception causing close of session 0x22767e1c9630000 > due to java.io.IOException: Read error". > Connection to node2 stood alive. > - All subsequent operations were refused with ZSESSIONMOVED error. > Error visible both on client and on server side. >=20 > Some more clarifications concerning questions from different posts: >=20 > 1. Saying we have huge packet drop I mean we have huge packet drop for > a local network. We are still investigating source of it. From time to > time we observe that machines loose contact with others for few > seconds. >=20 > 2. We do not use any fancy network tricks. All machines are in single > vlan and use permanently assigned IP addresses. >=20 > Let me know if anything more can help. >=20 > PS. Should I already create JIRA issue for this or is not concrete enough= ? >=20 > Regards, =A3ukasz >=20 > On Tue, Mar 16, 2010 at 20:39, Patrick Hunt wrote: >> Yea, that's great. (no complaints on my end, just wasn't sure what you >> meant, wanted to make sure I was clear.). >>=20 >> Can you identify some sort of pattern? We're discussing on our end, but = this >> is the first time we've heard of such an issue (outside 3.2.0 bug) and w= e're >> looking for a direction to hunt... anything you could add would help. >>=20 >> Is it possible for you to upgrade your client/server to 3.2.2 and report= the >> next failure, including any server/client logs if they are available? Yo= u >> should create a JIRA for this in order to track the details as they emer= ge. >> (incl attaching the logs, etc...) >>=20 >> Patrick >>=20 >> =A3ukasz Osipiuk wrote: >>>=20 >>> On Tue, Mar 16, 2010 at 20:05, Patrick Hunt wrote: >>>>=20 >>>> We'll probably need the ZK server/client logs to hunt this down. Can y= ou >>>> tell if the MOVED happens in some particular scenario, say you are >>>> connected >>>> to a follower and move to a leader, or perhaps you are connected to >>>> server >>>> A, get disconnected and reconnected to server A? .... is there some >>>> pattern >>>> that could help us understand what's causing this? >>>>=20 >>>>> Session gets broken after zookeeper library reconnects to cluster, >>>>> although most of reconnection leave session if working state. >>>>=20 >>>> I'm not sure I follow this ^^^, could you restate it? (I get the first >>>> about >>>> about the session is broken after reconnect, what's the second part >>>> saying?) >>>=20 >>> Sorry for my English. >>>=20 >>> I meant that it only happens from time to time. We get reconnects >>> quite often due to large packet loss in our network, but most of the >>> time they do not break the session. >>>=20 >>> Is it clear now? >>>=20 >>> Regards, =A3ukasz Osipiuk >>>=20 >>>> Regards, >>>>=20 >>>> Patrick >>>>=20 >>>> =A3ukasz Osipiuk wrote: >>>>>=20 >>>>> On Tue, Mar 16, 2010 at 19:22, Patrick Hunt wrote: >>>>>>=20 >>>>>> =A3ukasz Osipiuk wrote: >>>>>>>=20 >>>>>>> On Tue, Mar 16, 2010 at 17:18, Patrick Hunt wrot= e: >>>>>>>>=20 >>>>>>>> Can you verify that you are using 3.2.2 on all servers? You can do >>>>>>>> this >>>>>>>> by >>>>>>>> running the "stat" command against each of your servers and look a= t >>>>>>>> the >>>>>>>> very >>>>>>>> top of the output (we include the version of zk server there). >>>>>>>> http://bit.ly/dglVld >>>>>>>=20 >>>>>>> Thanks for hint. >>>>>>>=20 >>>>>>> We had 3.2.2 jars installed on all server - I checked it previously >>>>>>> but apparently our script which restarts instance >>>>>>> after installing .deb does not work very well and node2 and node3 w= ere >>>>>>> running version 3.2.1. >>>>>>> Could using 3.2.1 be source of the problem? >>>>>>>=20 >>>>>> I heard this story where customer service for dell asks customers to >>>>>> replug >>>>>> in the mouse because it might be "a dusty connection". Invariably wh= en >>>>>> ppl >>>>>> look at the back of the box they find that the mouse is not actually >>>>>> plugged >>>>>> in. Brilliant ideas from CS perspective. ;-) >>>>>>=20 >>>>>> I think that issue was fixed in 3.2.1, so I don't think that's an is= sue >>>>>> (version). >>>>>=20 >>>>> Damn :( >>>>>=20 >>>>>>>> Are you using synchronous or async operations in your client? >>>>>>>=20 >>>>>>> We are only using async operations. >>>>>>=20 >>>>>> Ok. good info. >>>>>>=20 >>>>>>>> I see snippets from the server logs, but I don't see anything from >>>>>>>> you >>>>>>>> client logs. It would really help if you could zip up all the logs >>>>>>>> (server >>>>>>>> and client) and provide them. >>>>>>>=20 >>>>>>> We do not really have much logging on application side :(. Failed >>>>>>> operations are logged with error code (ZSESSIONEXPIRED in our case)= . >>>>>>> We also log session state changes but we noticed only changes betwe= en >>>>>>> 1 >>>>>>> and 3. >>>>>>> Unfortunately we do not have logs coming directly from zookeeper >>>>>>> library as it logs to stderr/stdout and we do not store any of them= at >>>>>>> the moment. >>>>>>>=20 >>>>>> I see, in this case I think that the log messages theyselves are the >>>>>> problem, ie we are warning about this, but it's not a real problem. = We >>>>>> warn >>>>>> on things that we don't like but can handle, we error if we >>>>>> don'tlike/can'thandle. >>>>>>=20 >>>>>> MOVED basically means that a client initiated a request on one serve= r, >>>>>> then >>>>>> moved to another server before the request could be processed. >>>>>>=20 >>>>>> In your case you are using async requests, that can actually make th= is >>>>>> problem show up more (more "in flight" requests vs sync operation). >>>>>>=20 >>>>>> So if you are not seeing MOVED on the client then this is fine. If y= ou >>>>>> see >>>>>> them in the server log it means what I said above. >>>>>>=20 >>>>>> That make sense, or am I missing something? >>>>>=20 >>>>> It seems I was what I wrote was not clear. We are noticing errors on >>>>> client side. Any operation after >>>>> session gets into broken state fails with ZSESSIONMOVE error code (I >>>>> wrote ZSESSIONEXPIRED by mistake in previous post). >>>>> We get ZSESSIONMOVE from callback on rc parameter. >>>>>=20 >>>>> Session gets broken after zookeeper library reconnects to cluster, >>>>> although most of reconnection leave session if working state. >>>>>=20 >>>>> Regards, =A3ukasz >>>>>=20 >>>>>=20 >>>>>> Patrick >>>>>>=20 >>>>>>=20 >>>>>>=20 >>>>>>> Regards, =A3ukasz >>>>>>>=20 >>>>>>>> Patrick >>>>>>>>=20 >>>>>>>> =A3ukasz Osipiuk wrote: >>>>>>>>>=20 >>>>>>>>> not really - it happens occasionally - every few days :( >>>>>>>>> I believe it is somewhat connected with our network environment >>>>>>>>> which >>>>>>>>> suffers from some packet loss which leads to >>>>>>>>> connection timeouts. >>>>>>>>>=20 >>>>>>>>> I can switch on some more logging if you can lead me which >>>>>>>>> categories >>>>>>>>> are worth to enable DEBUG for them. >>>>>>>>>=20 >>>>>>>>> Regards, =A3ukasz Osipiuk >>>>>>>>>=20 >>>>>>>>> On Tue, Mar 16, 2010 at 16:35, Benjamin Reed >>>>>>>>> wrote: >>>>>>>>>>=20 >>>>>>>>>> weird, this does sound like a bug. do you have a reliable way of >>>>>>>>>> reproducing >>>>>>>>>> the problem? >>>>>>>>>>=20 >>>>>>>>>> thanx >>>>>>>>>> ben >>>>>>>>>>=20 >>>>>>>>>> On 03/16/2010 08:27 AM, =A3ukasz Osipiuk wrote: >>>>>>>>>>>=20 >>>>>>>>>>> nope. >>>>>>>>>>>=20 >>>>>>>>>>> I always pass 0 as clientid. >>>>>>>>>>>=20 >>>>>>>>>>> =A3ukasz >>>>>>>>>>>=20 >>>>>>>>>>> On Tue, Mar 16, 2010 at 16:20, Benjamin Reed >>>>>>>>>>> =A0wrote: >>>>>>>>>>>=20 >>>>>>>>>>>> do you ever use zookeeper_init() with the clientid field set t= o >>>>>>>>>>>> something >>>>>>>>>>>> other than null? >>>>>>>>>>>>=20 >>>>>>>>>>>> ben >>>>>>>>>>>>=20 >>>>>>>>>>>> On 03/16/2010 07:43 AM, =A3ukasz Osipiuk wrote: >>>>>>>>>>>>=20 >>>>>>>>>>>> Hi everyone! >>>>>>>>>>>>=20 >>>>>>>>>>>> I am writing to this group because recently we are getting som= e >>>>>>>>>>>> strange errors with our production zookeeper setup. >>>>>>>>>>>>=20 >>>>>>>>>>>> =A0From time to time we are observing that our client applicatio= n >>>>>>>>>>>> (C++ >>>>>>>>>>>> based) disconnects from zookeeper (session state is changed to >>>>>>>>>>>> 1) >>>>>>>>>>>> and >>>>>>>>>>>> reconnects (state changed to 3). >>>>>>>>>>>> This itself is not a problem - usually application continues t= o >>>>>>>>>>>> run >>>>>>>>>>>> without problems after reconnect. >>>>>>>>>>>> But from time to time after above happens all subsequent >>>>>>>>>>>> operations >>>>>>>>>>>> start to return ZSESSIONMOVED error. To make it work again we >>>>>>>>>>>> have >>>>>>>>>>>> to >>>>>>>>>>>> restart application (which creates new zookeeper session). >>>>>>>>>>>>=20 >>>>>>>>>>>> I noticed that in 3.2.0 introduced a bug >>>>>>>>>>>> http://issues.apache.org/jira/browse/ZOOKEEPER-449 but we are >>>>>>>>>>>> using >>>>>>>>>>>> zookeeper v. 3.2.2. >>>>>>>>>>>> I just noticed that app at compile time used 3.2.0 library but >>>>>>>>>>>> patches >>>>>>>>>>>> fixing bug 449 did not touch C client lib so I believe that ou= r >>>>>>>>>>>> problems are not >>>>>>>>>>>> related with that. >>>>>>>>>>>>=20 >>>>>>>>>>>> In zookeeper logs at moment which initiated the problem with >>>>>>>>>>>> client >>>>>>>>>>>> application I have >>>>>>>>>>>>=20 >>>>>>>>>>>> node1: >>>>>>>>>>>> 2010-03-16 14:21:43,510 - INFO >>>>>>>>>>>> [NIOServerCxn.Factory:2181:NIOServerCnxn@607] - Connected to >>>>>>>>>>>> /10.1.112.61:37197 lastZxid 42992576502 >>>>>>>>>>>> 2010-03-16 14:21:43,510 - INFO >>>>>>>>>>>> [NIOServerCxn.Factory:2181:NIOServerCnxn@636] - Renewing sessi= on >>>>>>>>>>>> 0x324dcc1ba580085 >>>>>>>>>>>> 2010-03-16 14:21:49,443 - INFO >>>>>>>>>>>> [QuorumPeer:/0:0:0:0:0:0:0:0:2181:NIOServerCnxn@992] - Finishe= d >>>>>>>>>>>> init >>>>>>>>>>>> of 0x324dcc1ba580085 valid:true >>>>>>>>>>>> 2010-03-16 14:21:49,443 - WARN >>>>>>>>>>>> [NIOServerCxn.Factory:2181:NIOServerCnxn@518] - Exception >>>>>>>>>>>> causing >>>>>>>>>>>> close of session 0x324dcc1ba580085 due to java.io.IOException: >>>>>>>>>>>> Read >>>>>>>>>>>> error >>>>>>>>>>>> 2010-03-16 14:21:49,444 - INFO >>>>>>>>>>>> [NIOServerCxn.Factory:2181:NIOServerCnxn@857] - closing >>>>>>>>>>>> session:0x324dcc1ba580085 NIOServerCnxn: >>>>>>>>>>>> java.nio.channels.SocketChannel[connected >>>>>>>>>>>> local=3D/10.1.112.62:2181 >>>>>>>>>>>> remote=3D/10.1.112.61:37197] >>>>>>>>>>>>=20 >>>>>>>>>>>> node2: >>>>>>>>>>>> 2010-03-16 14:21:40,580 - WARN >>>>>>>>>>>> [NIOServerCxn.Factory:2181:NIOServerCnxn@494] - Exception >>>>>>>>>>>> causing >>>>>>>>>>>> close of session 0x324dcc1ba580085 due to java.io.IOException: >>>>>>>>>>>> Read >>>>>>>>>>>> error >>>>>>>>>>>> 2010-03-16 14:21:40,581 - INFO >>>>>>>>>>>> [NIOServerCxn.Factory:2181:NIOServerCnxn@833] - closing >>>>>>>>>>>> session:0x324dcc1ba580085 NIOServerCnxn: >>>>>>>>>>>> java.nio.channels.SocketChannel[connected >>>>>>>>>>>> local=3D/10.1.112.63:2181 >>>>>>>>>>>> remote=3D/10.1.112.61:60693] >>>>>>>>>>>> 2010-03-16 14:21:46,839 - INFO >>>>>>>>>>>> [NIOServerCxn.Factory:2181:NIOServerCnxn@583] - Connected to >>>>>>>>>>>> /10.1.112.61:48336 lastZxid 42992576502 >>>>>>>>>>>> 2010-03-16 14:21:46,839 - INFO >>>>>>>>>>>> [NIOServerCxn.Factory:2181:NIOServerCnxn@612] - Renewing sessi= on >>>>>>>>>>>> 0x324dcc1ba580085 >>>>>>>>>>>> 2010-03-16 14:21:49,439 - INFO >>>>>>>>>>>> [QuorumPeer:/0:0:0:0:0:0:0:0:2181:NIOServerCnxn@964] - Finishe= d >>>>>>>>>>>> init >>>>>>>>>>>> of 0x324dcc1ba580085 valid:true >>>>>>>>>>>>=20 >>>>>>>>>>>> node3: >>>>>>>>>>>> 2010-03-16 02:14:48,961 - WARN >>>>>>>>>>>> [NIOServerCxn.Factory:2181:NIOServerCnxn@494] - Exception >>>>>>>>>>>> causing >>>>>>>>>>>> close of session 0x324dcc1ba580085 due to java.io.IOException: >>>>>>>>>>>> Read >>>>>>>>>>>> error >>>>>>>>>>>> 2010-03-16 02:14:48,962 - INFO >>>>>>>>>>>> [NIOServerCxn.Factory:2181:NIOServerCnxn@833] - closing >>>>>>>>>>>> session:0x324dcc1ba580085 NIOServerCnxn: >>>>>>>>>>>> java.nio.channels.SocketChannel[connected >>>>>>>>>>>> local=3D/10.1.112.64:2181 >>>>>>>>>>>> remote=3D/10.1.112.61:57309] >>>>>>>>>>>>=20 >>>>>>>>>>>> and then lots of entries like this >>>>>>>>>>>> 2010-03-16 02:14:54,696 - WARN >>>>>>>>>>>> [ProcessThread:-1:PrepRequestProcessor@402] - Got exception wh= en >>>>>>>>>>>> processing sessionid:0x324dcc1ba580085 type:create >>>>>>>>>>>> cxid:0x4b9e9e49 >>>>>>>>>>>> zxid:0xfffffffffffffffe txntype:unknown >>>>>>>>>>>> /locks/9871253/lock-8589943989- >>>>>>>>>>>> org.apache.zookeeper.KeeperException$SessionMovedException: >>>>>>>>>>>> KeeperErrorCode =3D Session moved >>>>>>>>>>>> =A0 =A0 at >>>>>>>>>>>>=20 >>>>>>>>>>>>=20 >>>>>>>>>>>>=20 >>>>>>>>>>>>=20 >>>>>>>>>>>>=20 >>>>>>>>>>>> org.apache.zookeeper.server.SessionTrackerImpl.checkSession(Se= ssion >>>>>>>>>>>> TrackerImpl.java:231) >>>>>>>>>>>> =A0 =A0 at >>>>>>>>>>>>=20 >>>>>>>>>>>>=20 >>>>>>>>>>>>=20 >>>>>>>>>>>>=20 >>>>>>>>>>>>=20 >>>>>>>>>>>> org.apache.zookeeper.server.PrepRequestProcessor.pRequest(Prep= Reque >>>>>>>>>>>> stProcessor.java:211) >>>>>>>>>>>> =A0 =A0 at >>>>>>>>>>>>=20 >>>>>>>>>>>>=20 >>>>>>>>>>>>=20 >>>>>>>>>>>>=20 >>>>>>>>>>>>=20 >>>>>>>>>>>> org.apache.zookeeper.server.PrepRequestProcessor.run(PrepReque= stPro >>>>>>>>>>>> cessor.java:114) >>>>>>>>>>>> 2010-03-16 14:22:06,428 - WARN >>>>>>>>>>>> [ProcessThread:-1:PrepRequestProcessor@402] - Got exception wh= en >>>>>>>>>>>> processing sessionid:0x324dcc1ba580085 type:create >>>>>>>>>>>> cxid:0x4b9f6603 >>>>>>>>>>>> zxid:0xfffffffffffffffe txntype:unknown >>>>>>>>>>>> /locks/1665960/lock-8589961006- >>>>>>>>>>>> org.apache.zookeeper.KeeperException$SessionMovedException: >>>>>>>>>>>> KeeperErrorCode =3D Session moved >>>>>>>>>>>> =A0 =A0 at >>>>>>>>>>>>=20 >>>>>>>>>>>>=20 >>>>>>>>>>>>=20 >>>>>>>>>>>>=20 >>>>>>>>>>>>=20 >>>>>>>>>>>> org.apache.zookeeper.server.SessionTrackerImpl.checkSession(Se= ssion >>>>>>>>>>>> TrackerImpl.java:231) >>>>>>>>>>>> =A0 =A0 at >>>>>>>>>>>>=20 >>>>>>>>>>>>=20 >>>>>>>>>>>>=20 >>>>>>>>>>>>=20 >>>>>>>>>>>>=20 >>>>>>>>>>>> org.apache.zookeeper.server.PrepRequestProcessor.pRequest(Prep= Reque >>>>>>>>>>>> stProcessor.java:211) >>>>>>>>>>>> =A0 =A0 at >>>>>>>>>>>>=20 >>>>>>>>>>>>=20 >>>>>>>>>>>>=20 >>>>>>>>>>>>=20 >>>>>>>>>>>>=20 >>>>>>>>>>>> org.apache.zookeeper.server.PrepRequestProcessor.run(PrepReque= stPro >>>>>>>>>>>> cessor.java:114) >>>>>>>>>>>>=20 >>>>>>>>>>>>=20 >>>>>>>>>>>> To workaround disconnections I am going to increase session >>>>>>>>>>>> timeout >>>>>>>>>>>> from 5 to 15 seconds but event if it helps at all it is just a >>>>>>>>>>>> workaround. >>>>>>>>>>>>=20 >>>>>>>>>>>> Do you have an idea where is the source of my problem. >>>>>>>>>>>>=20 >>>>>>>>>>>> Regards, =A3ukasz Osipiuk >>>>>>>>>>>>=20 >>>>>>>>>>>>=20 >>>>>>>>>>>>=20 >>>>>>>>>>>>=20 >>>>>>>>>>>>=20 >>>>>>>=20 >>>>>=20 >>>>>=20 >>>=20 >>>=20 >>>=20 >>=20 >=20 >=20