Return-Path: Delivered-To: apmail-hadoop-zookeeper-user-archive@minotaur.apache.org Received: (qmail 47196 invoked from network); 16 Mar 2010 19:25:27 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 16 Mar 2010 19:25:27 -0000 Received: (qmail 38186 invoked by uid 500); 16 Mar 2010 19:25:26 -0000 Delivered-To: apmail-hadoop-zookeeper-user-archive@hadoop.apache.org Received: (qmail 38166 invoked by uid 500); 16 Mar 2010 19:25:26 -0000 Mailing-List: contact zookeeper-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: zookeeper-user@hadoop.apache.org Delivered-To: mailing list zookeeper-user@hadoop.apache.org Received: (qmail 38158 invoked by uid 99); 16 Mar 2010 19:25:26 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 16 Mar 2010 19:25:26 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=RCVD_IN_DNSWL_NONE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of lukasz@osipiuk.net designates 209.85.220.213 as permitted sender) Received: from [209.85.220.213] (HELO mail-fx0-f213.google.com) (209.85.220.213) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 16 Mar 2010 19:25:22 +0000 Received: by fxm5 with SMTP id 5so317032fxm.29 for ; Tue, 16 Mar 2010 12:25:00 -0700 (PDT) MIME-Version: 1.0 Received: by 10.223.65.18 with SMTP id g18mr6767926fai.32.1268767499972; Tue, 16 Mar 2010 12:24:59 -0700 (PDT) In-Reply-To: <4B9FD670.7070600@apache.org> References: <3b910d891003160743k38e2e7c9y830b182d88396d55@mail.gmail.com> <4B9FA1A3.8020908@yahoo-inc.com> <3b910d891003160827x4326f05bx4f3b1d60d47ba890@mail.gmail.com> <4B9FA53C.9050804@yahoo-inc.com> <3b910d891003160854u3fe1364ek5409a8c40ac3d126@mail.gmail.com> <4B9FAF64.4000201@apache.org> <3b910d891003161003p3b9d5d7bs2959702d21452e38@mail.gmail.com> <4B9FCC6F.5070506@apache.org> <3b910d891003161151y6ba435e8vebf669caea94260e@mail.gmail.com> <4B9FD670.7070600@apache.org> Date: Tue, 16 Mar 2010 20:24:59 +0100 Message-ID: <3b910d891003161224h2ca9fd17n5a5c2f14f25d363a@mail.gmail.com> Subject: Re: permanent ZSESSIONMOVED From: =?UTF-8?Q?=C5=81ukasz_Osipiuk?= To: zookeeper-user@hadoop.apache.org Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable On Tue, Mar 16, 2010 at 20:05, Patrick Hunt wrote: > We'll probably need the ZK server/client logs to hunt this down. Can you > tell if the MOVED happens in some particular scenario, say you are connec= ted > to a follower and move to a leader, or perhaps you are connected to serve= r > A, get disconnected and reconnected to server A? .... is there some patte= rn > that could help us understand what's causing this? > >> Session gets broken after zookeeper library reconnects to cluster, >> although most of reconnection leave session if working state. > > I'm not sure I follow this ^^^, could you restate it? (I get the first ab= out > about the session is broken after reconnect, what's the second part sayin= g?) Sorry for my English. I meant that it only happens from time to time. We get reconnects quite often due to large packet loss in our network, but most of the time they do not break the session. Is it clear now? Regards, =C5=81ukasz Osipiuk > Regards, > > Patrick > > =C5=81ukasz Osipiuk wrote: >> >> On Tue, Mar 16, 2010 at 19:22, Patrick Hunt wrote: >>> >>> =C5=81ukasz Osipiuk wrote: >>>> >>>> On Tue, Mar 16, 2010 at 17:18, Patrick Hunt wrote: >>>>> >>>>> Can you verify that you are using 3.2.2 on all servers? You can do th= is >>>>> by >>>>> running the "stat" command against each of your servers and look at t= he >>>>> very >>>>> top of the output (we include the version of zk server there). >>>>> http://bit.ly/dglVld >>>> >>>> Thanks for hint. >>>> >>>> We had 3.2.2 jars installed on all server - I checked it previously >>>> but apparently our script which restarts instance >>>> after installing .deb does not work very well and node2 and node3 were >>>> running version 3.2.1. >>>> Could using 3.2.1 be source of the problem? >>>> >>> I heard this story where customer service for dell asks customers to >>> replug >>> in the mouse because it might be "a dusty connection". Invariably when >>> ppl >>> look at the back of the box they find that the mouse is not actually >>> plugged >>> in. Brilliant ideas from CS perspective. ;-) >>> >>> I think that issue was fixed in 3.2.1, so I don't think that's an issue >>> (version). >> >> Damn :( >> >>>>> Are you using synchronous or async operations in your client? >>>> >>>> We are only using async operations. >>> >>> Ok. good info. >>> >>>>> I see snippets from the server logs, but I don't see anything from yo= u >>>>> client logs. It would really help if you could zip up all the logs >>>>> (server >>>>> and client) and provide them. >>>> >>>> We do not really have much logging on application side :(. Failed >>>> operations are logged with error code (ZSESSIONEXPIRED in our case). >>>> We also log session state changes but we noticed only changes between = 1 >>>> and 3. >>>> Unfortunately we do not have logs coming directly from zookeeper >>>> library as it logs to stderr/stdout and we do not store any of them at >>>> the moment. >>>> >>> I see, in this case I think that the log messages theyselves are the >>> problem, ie we are warning about this, but it's not a real problem. We >>> warn >>> on things that we don't like but can handle, we error if we >>> don'tlike/can'thandle. >>> >>> MOVED basically means that a client initiated a request on one server, >>> then >>> moved to another server before the request could be processed. >>> >>> In your case you are using async requests, that can actually make this >>> problem show up more (more "in flight" requests vs sync operation). >>> >>> So if you are not seeing MOVED on the client then this is fine. If you >>> see >>> them in the server log it means what I said above. >>> >>> That make sense, or am I missing something? >> >> It seems I was what I wrote was not clear. We are noticing errors on >> client side. Any operation after >> session gets into broken state fails with ZSESSIONMOVE error code (I >> wrote ZSESSIONEXPIRED by mistake in previous post). >> We get ZSESSIONMOVE from callback on rc parameter. >> >> Session gets broken after zookeeper library reconnects to cluster, >> although most of reconnection leave session if working state. >> >> Regards, =C5=81ukasz >> >> >>> Patrick >>> >>> >>> >>>> Regards, =C5=81ukasz >>>> >>>>> Patrick >>>>> >>>>> =C5=81ukasz Osipiuk wrote: >>>>>> >>>>>> not really - it happens occasionally - every few days :( >>>>>> I believe it is somewhat connected with our network environment whic= h >>>>>> suffers from some packet loss which leads to >>>>>> connection timeouts. >>>>>> >>>>>> I can switch on some more logging if you can lead me which categorie= s >>>>>> are worth to enable DEBUG for them. >>>>>> >>>>>> Regards, =C5=81ukasz Osipiuk >>>>>> >>>>>> On Tue, Mar 16, 2010 at 16:35, Benjamin Reed >>>>>> wrote: >>>>>>> >>>>>>> weird, this does sound like a bug. do you have a reliable way of >>>>>>> reproducing >>>>>>> the problem? >>>>>>> >>>>>>> thanx >>>>>>> ben >>>>>>> >>>>>>> On 03/16/2010 08:27 AM, =C5=81ukasz Osipiuk wrote: >>>>>>>> >>>>>>>> nope. >>>>>>>> >>>>>>>> I always pass 0 as clientid. >>>>>>>> >>>>>>>> =C5=81ukasz >>>>>>>> >>>>>>>> On Tue, Mar 16, 2010 at 16:20, Benjamin Reed >>>>>>>> =C2=A0wrote: >>>>>>>> >>>>>>>>> do you ever use zookeeper_init() with the clientid field set to >>>>>>>>> something >>>>>>>>> other than null? >>>>>>>>> >>>>>>>>> ben >>>>>>>>> >>>>>>>>> On 03/16/2010 07:43 AM, =C5=81ukasz Osipiuk wrote: >>>>>>>>> >>>>>>>>>> Hi everyone! >>>>>>>>>> >>>>>>>>>> I am writing to this group because recently we are getting some >>>>>>>>>> strange errors with our production zookeeper setup. >>>>>>>>>> >>>>>>>>>> =C2=A0From time to time we are observing that our client applica= tion >>>>>>>>>> (C++ >>>>>>>>>> based) disconnects from zookeeper (session state is changed to 1= ) >>>>>>>>>> and >>>>>>>>>> reconnects (state changed to 3). >>>>>>>>>> This itself is not a problem - usually application continues to >>>>>>>>>> run >>>>>>>>>> without problems after reconnect. >>>>>>>>>> But from time to time after above happens all subsequent >>>>>>>>>> operations >>>>>>>>>> start to return ZSESSIONMOVED error. To make it work again we ha= ve >>>>>>>>>> to >>>>>>>>>> restart application (which creates new zookeeper session). >>>>>>>>>> >>>>>>>>>> I noticed that in 3.2.0 introduced a bug >>>>>>>>>> http://issues.apache.org/jira/browse/ZOOKEEPER-449 but we are >>>>>>>>>> using >>>>>>>>>> zookeeper v. 3.2.2. >>>>>>>>>> I just noticed that app at compile time used 3.2.0 library but >>>>>>>>>> patches >>>>>>>>>> fixing bug 449 did not touch C client lib so I believe that our >>>>>>>>>> problems are not >>>>>>>>>> related with that. >>>>>>>>>> >>>>>>>>>> In zookeeper logs at moment which initiated the problem with >>>>>>>>>> client >>>>>>>>>> application I have >>>>>>>>>> >>>>>>>>>> node1: >>>>>>>>>> 2010-03-16 14:21:43,510 - INFO >>>>>>>>>> [NIOServerCxn.Factory:2181:NIOServerCnxn@607] - Connected to >>>>>>>>>> /10.1.112.61:37197 lastZxid 42992576502 >>>>>>>>>> 2010-03-16 14:21:43,510 - INFO >>>>>>>>>> [NIOServerCxn.Factory:2181:NIOServerCnxn@636] - Renewing session >>>>>>>>>> 0x324dcc1ba580085 >>>>>>>>>> 2010-03-16 14:21:49,443 - INFO >>>>>>>>>> [QuorumPeer:/0:0:0:0:0:0:0:0:2181:NIOServerCnxn@992] - Finished >>>>>>>>>> init >>>>>>>>>> of 0x324dcc1ba580085 valid:true >>>>>>>>>> 2010-03-16 14:21:49,443 - WARN >>>>>>>>>> [NIOServerCxn.Factory:2181:NIOServerCnxn@518] - Exception causin= g >>>>>>>>>> close of session 0x324dcc1ba580085 due to java.io.IOException: >>>>>>>>>> Read >>>>>>>>>> error >>>>>>>>>> 2010-03-16 14:21:49,444 - INFO >>>>>>>>>> [NIOServerCxn.Factory:2181:NIOServerCnxn@857] - closing >>>>>>>>>> session:0x324dcc1ba580085 NIOServerCnxn: >>>>>>>>>> java.nio.channels.SocketChannel[connected local=3D/10.1.112.62:2= 181 >>>>>>>>>> remote=3D/10.1.112.61:37197] >>>>>>>>>> >>>>>>>>>> node2: >>>>>>>>>> 2010-03-16 14:21:40,580 - WARN >>>>>>>>>> [NIOServerCxn.Factory:2181:NIOServerCnxn@494] - Exception causin= g >>>>>>>>>> close of session 0x324dcc1ba580085 due to java.io.IOException: >>>>>>>>>> Read >>>>>>>>>> error >>>>>>>>>> 2010-03-16 14:21:40,581 - INFO >>>>>>>>>> [NIOServerCxn.Factory:2181:NIOServerCnxn@833] - closing >>>>>>>>>> session:0x324dcc1ba580085 NIOServerCnxn: >>>>>>>>>> java.nio.channels.SocketChannel[connected local=3D/10.1.112.63:2= 181 >>>>>>>>>> remote=3D/10.1.112.61:60693] >>>>>>>>>> 2010-03-16 14:21:46,839 - INFO >>>>>>>>>> [NIOServerCxn.Factory:2181:NIOServerCnxn@583] - Connected to >>>>>>>>>> /10.1.112.61:48336 lastZxid 42992576502 >>>>>>>>>> 2010-03-16 14:21:46,839 - INFO >>>>>>>>>> [NIOServerCxn.Factory:2181:NIOServerCnxn@612] - Renewing session >>>>>>>>>> 0x324dcc1ba580085 >>>>>>>>>> 2010-03-16 14:21:49,439 - INFO >>>>>>>>>> [QuorumPeer:/0:0:0:0:0:0:0:0:2181:NIOServerCnxn@964] - Finished >>>>>>>>>> init >>>>>>>>>> of 0x324dcc1ba580085 valid:true >>>>>>>>>> >>>>>>>>>> node3: >>>>>>>>>> 2010-03-16 02:14:48,961 - WARN >>>>>>>>>> [NIOServerCxn.Factory:2181:NIOServerCnxn@494] - Exception causin= g >>>>>>>>>> close of session 0x324dcc1ba580085 due to java.io.IOException: >>>>>>>>>> Read >>>>>>>>>> error >>>>>>>>>> 2010-03-16 02:14:48,962 - INFO >>>>>>>>>> [NIOServerCxn.Factory:2181:NIOServerCnxn@833] - closing >>>>>>>>>> session:0x324dcc1ba580085 NIOServerCnxn: >>>>>>>>>> java.nio.channels.SocketChannel[connected local=3D/10.1.112.64:2= 181 >>>>>>>>>> remote=3D/10.1.112.61:57309] >>>>>>>>>> >>>>>>>>>> and then lots of entries like this >>>>>>>>>> 2010-03-16 02:14:54,696 - WARN >>>>>>>>>> [ProcessThread:-1:PrepRequestProcessor@402] - Got exception when >>>>>>>>>> processing sessionid:0x324dcc1ba580085 type:create cxid:0x4b9e9e= 49 >>>>>>>>>> zxid:0xfffffffffffffffe txntype:unknown >>>>>>>>>> /locks/9871253/lock-8589943989- >>>>>>>>>> org.apache.zookeeper.KeeperException$SessionMovedException: >>>>>>>>>> KeeperErrorCode =3D Session moved >>>>>>>>>> =C2=A0 =C2=A0 =C2=A0at >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> org.apache.zookeeper.server.SessionTrackerImpl.checkSession(Sess= ionTrackerImpl.java:231) >>>>>>>>>> =C2=A0 =C2=A0 =C2=A0at >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> org.apache.zookeeper.server.PrepRequestProcessor.pRequest(PrepRe= questProcessor.java:211) >>>>>>>>>> =C2=A0 =C2=A0 =C2=A0at >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> org.apache.zookeeper.server.PrepRequestProcessor.run(PrepRequest= Processor.java:114) >>>>>>>>>> 2010-03-16 14:22:06,428 - WARN >>>>>>>>>> [ProcessThread:-1:PrepRequestProcessor@402] - Got exception when >>>>>>>>>> processing sessionid:0x324dcc1ba580085 type:create cxid:0x4b9f66= 03 >>>>>>>>>> zxid:0xfffffffffffffffe txntype:unknown >>>>>>>>>> /locks/1665960/lock-8589961006- >>>>>>>>>> org.apache.zookeeper.KeeperException$SessionMovedException: >>>>>>>>>> KeeperErrorCode =3D Session moved >>>>>>>>>> =C2=A0 =C2=A0 =C2=A0at >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> org.apache.zookeeper.server.SessionTrackerImpl.checkSession(Sess= ionTrackerImpl.java:231) >>>>>>>>>> =C2=A0 =C2=A0 =C2=A0at >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> org.apache.zookeeper.server.PrepRequestProcessor.pRequest(PrepRe= questProcessor.java:211) >>>>>>>>>> =C2=A0 =C2=A0 =C2=A0at >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> org.apache.zookeeper.server.PrepRequestProcessor.run(PrepRequest= Processor.java:114) >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> To workaround disconnections I am going to increase session >>>>>>>>>> timeout >>>>>>>>>> from 5 to 15 seconds but event if it helps at all it is just a >>>>>>>>>> workaround. >>>>>>>>>> >>>>>>>>>> Do you have an idea where is the source of my problem. >>>>>>>>>> >>>>>>>>>> Regards, =C5=81ukasz Osipiuk >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>> >>>> >>>> >> >> >> > --=20 --=20 =C5=81ukasz Osipiuk mailto:lukasz@osipiuk.net