hadoop-zookeeper-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mahadev Konar <maha...@yahoo-inc.com>
Subject Re: permanent ZSESSIONMOVED
Date Thu, 18 Mar 2010 01:32:23 GMT
Hi Lukasz,
  I looked at the logs and havent been able to determine how the session
move can happen. 


  You should go ahead and open a jira for this! Please attach the logs to
that jira. Mark it for 3.3.0 until we can say that it isnt a problem as of
now. 

Also, please go ahead and attach non filtered logs to the jira for the same
timeline as you had done earlier (by non filtered I mean not grepping for
the session id). Also, do attach the earlier logs you had emailed.

Also, may I know what function are you using to print that statement


Mar 16 22:09:01 ggmail-c1-mbox-1 ggmail_mbox[7976]: ERROR: ZooKeeper:
ZooKeeper::getChildren() path='/ggmboxlocks/16314'; callback error='session
moved to another server, so operation is ignored'


(If you can cut and paste that code on the jira that you create, it would be
helpful!)... 

Thanks
mahadev


On 3/17/10 2:56 AM, "Łukasz Osipiuk" <lukasz@osipiuk.net> wrote:

> Ok.
> 
> I analysed some logs from time when problem last occurred.
> 
> brokensession.txt - logs concerning session which had a problem
> leader.txt - logs concerning leader election (i grepped for Leader in
> zookeeper.log)
> 
> Some observations below
> - Network glitch which resulted in problem occurred at about 22:08.
> - From what I see since 17:48 node2 was the leader and it did not
> change later yesterday.
> - Client was connected to node2 since 17:50
> - At around 22:09 client tried to connect to every node (1,2,3).
> Connections to node1 and node3 were closed
>   with exception "Exception causing close of session 0x22767e1c9630000
> due to java.io.IOException: Read error".
>   Connection to node2 stood alive.
> - All subsequent operations were refused with ZSESSIONMOVED error.
> Error visible both on client and on server side.
> 
> Some more clarifications concerning questions from different posts:
> 
> 1. Saying we have huge packet drop I mean we have huge packet drop for
> a local network. We are still investigating source of it. From time to
> time we observe that machines loose contact with others for few
> seconds.
> 
> 2. We do not use any fancy network tricks. All machines are in single
> vlan and use permanently assigned IP addresses.
> 
> Let me know if anything more can help.
> 
> PS. Should I already create JIRA issue for this or is not concrete enough?
> 
> Regards, Łukasz
> 
> On Tue, Mar 16, 2010 at 20:39, Patrick Hunt <phunt@apache.org> wrote:
>> Yea, that's great. (no complaints on my end, just wasn't sure what you
>> meant, wanted to make sure I was clear.).
>> 
>> Can you identify some sort of pattern? We're discussing on our end, but this
>> is the first time we've heard of such an issue (outside 3.2.0 bug) and we're
>> looking for a direction to hunt... anything you could add would help.
>> 
>> Is it possible for you to upgrade your client/server to 3.2.2 and report the
>> next failure, including any server/client logs if they are available? You
>> should create a JIRA for this in order to track the details as they emerge.
>> (incl attaching the logs, etc...)
>> 
>> Patrick
>> 
>> Łukasz Osipiuk wrote:
>>> 
>>> On Tue, Mar 16, 2010 at 20:05, Patrick Hunt <phunt@apache.org> wrote:
>>>> 
>>>> We'll probably need the ZK server/client logs to hunt this down. Can you
>>>> tell if the MOVED happens in some particular scenario, say you are
>>>> connected
>>>> to a follower and move to a leader, or perhaps you are connected to
>>>> server
>>>> A, get disconnected and reconnected to server A? .... is there some
>>>> pattern
>>>> that could help us understand what's causing this?
>>>> 
>>>>> Session gets broken after zookeeper library reconnects to cluster,
>>>>> although most of reconnection leave session if working state.
>>>> 
>>>> I'm not sure I follow this ^^^, could you restate it? (I get the first
>>>> about
>>>> about the session is broken after reconnect, what's the second part
>>>> saying?)
>>> 
>>> Sorry for my English.
>>> 
>>> I meant that it only happens from time to time. We get reconnects
>>> quite often due to large packet loss in our network, but most of the
>>> time they do not break the session.
>>> 
>>> Is it clear now?
>>> 
>>> Regards, Łukasz Osipiuk
>>> 
>>>> Regards,
>>>> 
>>>> Patrick
>>>> 
>>>> Łukasz Osipiuk wrote:
>>>>> 
>>>>> On Tue, Mar 16, 2010 at 19:22, Patrick Hunt <phunt@apache.org>
wrote:
>>>>>> 
>>>>>> Łukasz Osipiuk wrote:
>>>>>>> 
>>>>>>> On Tue, Mar 16, 2010 at 17:18, Patrick Hunt <phunt@apache.org>
wrote:
>>>>>>>> 
>>>>>>>> Can you verify that you are using 3.2.2 on all servers? You
can do
>>>>>>>> this
>>>>>>>> by
>>>>>>>> running the "stat" command against each of your servers and
look at
>>>>>>>> the
>>>>>>>> very
>>>>>>>> top of the output (we include the version of zk server there).
>>>>>>>> http://bit.ly/dglVld
>>>>>>> 
>>>>>>> Thanks for hint.
>>>>>>> 
>>>>>>> We had 3.2.2 jars installed on all server - I checked it previously
>>>>>>> but apparently our script which restarts instance
>>>>>>> after installing .deb does not work very well and node2 and node3
were
>>>>>>> running version 3.2.1.
>>>>>>> Could using 3.2.1 be source of the problem?
>>>>>>> 
>>>>>> I heard this story where customer service for dell asks customers
to
>>>>>> replug
>>>>>> in the mouse because it might be "a dusty connection". Invariably
when
>>>>>> ppl
>>>>>> look at the back of the box they find that the mouse is not actually
>>>>>> plugged
>>>>>> in. Brilliant ideas from CS perspective. ;-)
>>>>>> 
>>>>>> I think that issue was fixed in 3.2.1, so I don't think that's an
issue
>>>>>> (version).
>>>>> 
>>>>> Damn :(
>>>>> 
>>>>>>>> Are you using synchronous or async operations in your client?
>>>>>>> 
>>>>>>> We are only using async operations.
>>>>>> 
>>>>>> Ok. good info.
>>>>>> 
>>>>>>>> I see snippets from the server logs, but I don't see anything
from
>>>>>>>> you
>>>>>>>> client logs. It would really help if you could zip up all
the logs
>>>>>>>> (server
>>>>>>>> and client) and provide them.
>>>>>>> 
>>>>>>> We do not really have much logging on application side :(. Failed
>>>>>>> operations are logged with error code (ZSESSIONEXPIRED in our
case).
>>>>>>> We also log session state changes but we noticed only changes
between
>>>>>>> 1
>>>>>>> and 3.
>>>>>>> Unfortunately we do not have logs coming directly from zookeeper
>>>>>>> library as it logs to stderr/stdout and we do not store any of
them at
>>>>>>> the moment.
>>>>>>> 
>>>>>> I see, in this case I think that the log messages theyselves are
the
>>>>>> problem, ie we are warning about this, but it's not a real problem.
We
>>>>>> warn
>>>>>> on things that we don't like but can handle, we error if we
>>>>>> don'tlike/can'thandle.
>>>>>> 
>>>>>> MOVED basically means that a client initiated a request on one server,
>>>>>> then
>>>>>> moved to another server before the request could be processed.
>>>>>> 
>>>>>> In your case you are using async requests, that can actually make
this
>>>>>> problem show up more (more "in flight" requests vs sync operation).
>>>>>> 
>>>>>> So if you are not seeing MOVED on the client then this is fine. If
you
>>>>>> see
>>>>>> them in the server log it means what I said above.
>>>>>> 
>>>>>> That make sense, or am I missing something?
>>>>> 
>>>>> It seems I was what I wrote was not clear. We are noticing errors on
>>>>> client side. Any operation after
>>>>> session gets into broken state fails with ZSESSIONMOVE error code (I
>>>>> wrote ZSESSIONEXPIRED by mistake in previous post).
>>>>> We get ZSESSIONMOVE from callback on rc parameter.
>>>>> 
>>>>> Session gets broken after zookeeper library reconnects to cluster,
>>>>> although most of reconnection leave session if working state.
>>>>> 
>>>>> Regards, Łukasz
>>>>> 
>>>>> 
>>>>>> Patrick
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>>> Regards, Łukasz
>>>>>>> 
>>>>>>>> Patrick
>>>>>>>> 
>>>>>>>> Łukasz Osipiuk wrote:
>>>>>>>>> 
>>>>>>>>> not really - it happens occasionally - every few days
:(
>>>>>>>>> I believe it is somewhat connected with our network environment
>>>>>>>>> which
>>>>>>>>> suffers from some packet loss which leads to
>>>>>>>>> connection timeouts.
>>>>>>>>> 
>>>>>>>>> I can switch on some more logging if you can lead me
which
>>>>>>>>> categories
>>>>>>>>> are worth to enable DEBUG for them.
>>>>>>>>> 
>>>>>>>>> Regards, Łukasz Osipiuk
>>>>>>>>> 
>>>>>>>>> On Tue, Mar 16, 2010 at 16:35, Benjamin Reed <breed@yahoo-inc.com>
>>>>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>> weird, this does sound like a bug. do you have a
reliable way of
>>>>>>>>>> reproducing
>>>>>>>>>> the problem?
>>>>>>>>>> 
>>>>>>>>>> thanx
>>>>>>>>>> ben
>>>>>>>>>> 
>>>>>>>>>> On 03/16/2010 08:27 AM, Łukasz Osipiuk wrote:
>>>>>>>>>>> 
>>>>>>>>>>> nope.
>>>>>>>>>>> 
>>>>>>>>>>> I always pass 0 as clientid.
>>>>>>>>>>> 
>>>>>>>>>>> Łukasz
>>>>>>>>>>> 
>>>>>>>>>>> On Tue, Mar 16, 2010 at 16:20, Benjamin Reed<breed@yahoo-inc.com>
>>>>>>>>>>>  wrote:
>>>>>>>>>>> 
>>>>>>>>>>>> do you ever use zookeeper_init() with the
clientid field set to
>>>>>>>>>>>> something
>>>>>>>>>>>> other than null?
>>>>>>>>>>>> 
>>>>>>>>>>>> ben
>>>>>>>>>>>> 
>>>>>>>>>>>> On 03/16/2010 07:43 AM, Łukasz Osipiuk wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>> Hi everyone!
>>>>>>>>>>>> 
>>>>>>>>>>>> I am writing to this group because recently
we are getting some
>>>>>>>>>>>> strange errors with our production zookeeper
setup.
>>>>>>>>>>>> 
>>>>>>>>>>>>  From time to time we are observing that
our client application
>>>>>>>>>>>> (C++
>>>>>>>>>>>> based) disconnects from zookeeper (session
state is changed to
>>>>>>>>>>>> 1)
>>>>>>>>>>>> and
>>>>>>>>>>>> reconnects (state changed to 3).
>>>>>>>>>>>> This itself is not a problem - usually application
continues to
>>>>>>>>>>>> run
>>>>>>>>>>>> without problems after reconnect.
>>>>>>>>>>>> But from time to time after above happens
all subsequent
>>>>>>>>>>>> operations
>>>>>>>>>>>> start to return ZSESSIONMOVED error. To make
it work again we
>>>>>>>>>>>> have
>>>>>>>>>>>> to
>>>>>>>>>>>> restart application (which creates new zookeeper
session).
>>>>>>>>>>>> 
>>>>>>>>>>>> I noticed that in 3.2.0 introduced a bug
>>>>>>>>>>>> http://issues.apache.org/jira/browse/ZOOKEEPER-449
but we are
>>>>>>>>>>>> using
>>>>>>>>>>>> zookeeper v. 3.2.2.
>>>>>>>>>>>> I just noticed that app at compile time used
3.2.0 library but
>>>>>>>>>>>> patches
>>>>>>>>>>>> fixing bug 449 did not touch C client lib
so I believe that our
>>>>>>>>>>>> problems are not
>>>>>>>>>>>> related with that.
>>>>>>>>>>>> 
>>>>>>>>>>>> In zookeeper logs at moment which initiated
the problem with
>>>>>>>>>>>> client
>>>>>>>>>>>> application I have
>>>>>>>>>>>> 
>>>>>>>>>>>> node1:
>>>>>>>>>>>> 2010-03-16 14:21:43,510 - INFO
>>>>>>>>>>>> [NIOServerCxn.Factory:2181:NIOServerCnxn@607]
- Connected to
>>>>>>>>>>>> /10.1.112.61:37197 lastZxid 42992576502
>>>>>>>>>>>> 2010-03-16 14:21:43,510 - INFO
>>>>>>>>>>>> [NIOServerCxn.Factory:2181:NIOServerCnxn@636]
- Renewing session
>>>>>>>>>>>> 0x324dcc1ba580085
>>>>>>>>>>>> 2010-03-16 14:21:49,443 - INFO
>>>>>>>>>>>> [QuorumPeer:/0:0:0:0:0:0:0:0:2181:NIOServerCnxn@992]
- Finished
>>>>>>>>>>>> init
>>>>>>>>>>>> of 0x324dcc1ba580085 valid:true
>>>>>>>>>>>> 2010-03-16 14:21:49,443 - WARN
>>>>>>>>>>>> [NIOServerCxn.Factory:2181:NIOServerCnxn@518]
- Exception
>>>>>>>>>>>> causing
>>>>>>>>>>>> close of session 0x324dcc1ba580085 due to
java.io.IOException:
>>>>>>>>>>>> Read
>>>>>>>>>>>> error
>>>>>>>>>>>> 2010-03-16 14:21:49,444 - INFO
>>>>>>>>>>>> [NIOServerCxn.Factory:2181:NIOServerCnxn@857]
- closing
>>>>>>>>>>>> session:0x324dcc1ba580085 NIOServerCnxn:
>>>>>>>>>>>> java.nio.channels.SocketChannel[connected
>>>>>>>>>>>> local=/10.1.112.62:2181
>>>>>>>>>>>> remote=/10.1.112.61:37197]
>>>>>>>>>>>> 
>>>>>>>>>>>> node2:
>>>>>>>>>>>> 2010-03-16 14:21:40,580 - WARN
>>>>>>>>>>>> [NIOServerCxn.Factory:2181:NIOServerCnxn@494]
- Exception
>>>>>>>>>>>> causing
>>>>>>>>>>>> close of session 0x324dcc1ba580085 due to
java.io.IOException:
>>>>>>>>>>>> Read
>>>>>>>>>>>> error
>>>>>>>>>>>> 2010-03-16 14:21:40,581 - INFO
>>>>>>>>>>>> [NIOServerCxn.Factory:2181:NIOServerCnxn@833]
- closing
>>>>>>>>>>>> session:0x324dcc1ba580085 NIOServerCnxn:
>>>>>>>>>>>> java.nio.channels.SocketChannel[connected
>>>>>>>>>>>> local=/10.1.112.63:2181
>>>>>>>>>>>> remote=/10.1.112.61:60693]
>>>>>>>>>>>> 2010-03-16 14:21:46,839 - INFO
>>>>>>>>>>>> [NIOServerCxn.Factory:2181:NIOServerCnxn@583]
- Connected to
>>>>>>>>>>>> /10.1.112.61:48336 lastZxid 42992576502
>>>>>>>>>>>> 2010-03-16 14:21:46,839 - INFO
>>>>>>>>>>>> [NIOServerCxn.Factory:2181:NIOServerCnxn@612]
- Renewing session
>>>>>>>>>>>> 0x324dcc1ba580085
>>>>>>>>>>>> 2010-03-16 14:21:49,439 - INFO
>>>>>>>>>>>> [QuorumPeer:/0:0:0:0:0:0:0:0:2181:NIOServerCnxn@964]
- Finished
>>>>>>>>>>>> init
>>>>>>>>>>>> of 0x324dcc1ba580085 valid:true
>>>>>>>>>>>> 
>>>>>>>>>>>> node3:
>>>>>>>>>>>> 2010-03-16 02:14:48,961 - WARN
>>>>>>>>>>>> [NIOServerCxn.Factory:2181:NIOServerCnxn@494]
- Exception
>>>>>>>>>>>> causing
>>>>>>>>>>>> close of session 0x324dcc1ba580085 due to
java.io.IOException:
>>>>>>>>>>>> Read
>>>>>>>>>>>> error
>>>>>>>>>>>> 2010-03-16 02:14:48,962 - INFO
>>>>>>>>>>>> [NIOServerCxn.Factory:2181:NIOServerCnxn@833]
- closing
>>>>>>>>>>>> session:0x324dcc1ba580085 NIOServerCnxn:
>>>>>>>>>>>> java.nio.channels.SocketChannel[connected
>>>>>>>>>>>> local=/10.1.112.64:2181
>>>>>>>>>>>> remote=/10.1.112.61:57309]
>>>>>>>>>>>> 
>>>>>>>>>>>> and then lots of entries like this
>>>>>>>>>>>> 2010-03-16 02:14:54,696 - WARN
>>>>>>>>>>>> [ProcessThread:-1:PrepRequestProcessor@402]
- Got exception when
>>>>>>>>>>>> processing sessionid:0x324dcc1ba580085 type:create
>>>>>>>>>>>> cxid:0x4b9e9e49
>>>>>>>>>>>> zxid:0xfffffffffffffffe txntype:unknown
>>>>>>>>>>>> /locks/9871253/lock-8589943989-
>>>>>>>>>>>> org.apache.zookeeper.KeeperException$SessionMovedException:
>>>>>>>>>>>> KeeperErrorCode = Session moved
>>>>>>>>>>>>     at
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> org.apache.zookeeper.server.SessionTrackerImpl.checkSession(Session
>>>>>>>>>>>> TrackerImpl.java:231)
>>>>>>>>>>>>     at
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> org.apache.zookeeper.server.PrepRequestProcessor.pRequest(PrepReque
>>>>>>>>>>>> stProcessor.java:211)
>>>>>>>>>>>>     at
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> org.apache.zookeeper.server.PrepRequestProcessor.run(PrepRequestPro
>>>>>>>>>>>> cessor.java:114)
>>>>>>>>>>>> 2010-03-16 14:22:06,428 - WARN
>>>>>>>>>>>> [ProcessThread:-1:PrepRequestProcessor@402]
- Got exception when
>>>>>>>>>>>> processing sessionid:0x324dcc1ba580085 type:create
>>>>>>>>>>>> cxid:0x4b9f6603
>>>>>>>>>>>> zxid:0xfffffffffffffffe txntype:unknown
>>>>>>>>>>>> /locks/1665960/lock-8589961006-
>>>>>>>>>>>> org.apache.zookeeper.KeeperException$SessionMovedException:
>>>>>>>>>>>> KeeperErrorCode = Session moved
>>>>>>>>>>>>     at
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> org.apache.zookeeper.server.SessionTrackerImpl.checkSession(Session
>>>>>>>>>>>> TrackerImpl.java:231)
>>>>>>>>>>>>     at
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> org.apache.zookeeper.server.PrepRequestProcessor.pRequest(PrepReque
>>>>>>>>>>>> stProcessor.java:211)
>>>>>>>>>>>>     at
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> org.apache.zookeeper.server.PrepRequestProcessor.run(PrepRequestPro
>>>>>>>>>>>> cessor.java:114)
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> To workaround disconnections I am going to
increase session
>>>>>>>>>>>> timeout
>>>>>>>>>>>> from 5 to 15 seconds but event if it helps
at all it is just a
>>>>>>>>>>>> workaround.
>>>>>>>>>>>> 
>>>>>>>>>>>> Do you have an idea where is the source of
my problem.
>>>>>>>>>>>> 
>>>>>>>>>>>> Regards, Łukasz Osipiuk
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>> 
>>>>> 
>>>>> 
>>> 
>>> 
>>> 
>> 
> 
> 


Mime
View raw message