zookeeper-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Patrick Hunt <ph...@apache.org>
Subject Re: permanent ZSESSIONMOVED
Date Thu, 18 Mar 2010 02:13:57 GMT
JIRA is great because we can move the discussion off the list and track 
explicitly along with possible fixes. Also if you create the jira you'll 
get updates on any changes to the record (you could 'watch' the JIRA 
instead, that's fine too).

Can you include the session id in your client logs? That would be very 
helpful for debugging. (it's available from ZooKeeper class)

Best if you tar/gz the logs (zip fine too). Having the full server (all 
server logs) and client logs for the time/incident in question we can 
"follow along" if you will, pretty closely with what's happening. It 
should really shed some light.

Thanks for your help on this, we'd really like to nail this one down.

Regards,

Patrick

Mahadev Konar wrote:
> Hi Lukasz,
>   I looked at the logs and havent been able to determine how the session
> move can happen. 
> 
> 
>   You should go ahead and open a jira for this! Please attach the logs to
> that jira. Mark it for 3.3.0 until we can say that it isnt a problem as of
> now. 
> 
> Also, please go ahead and attach non filtered logs to the jira for the same
> timeline as you had done earlier (by non filtered I mean not grepping for
> the session id). Also, do attach the earlier logs you had emailed.
> 
> Also, may I know what function are you using to print that statement
> 
> 
> Mar 16 22:09:01 ggmail-c1-mbox-1 ggmail_mbox[7976]: ERROR: ZooKeeper:
> ZooKeeper::getChildren() path='/ggmboxlocks/16314'; callback error='session
> moved to another server, so operation is ignored'
> 
> 
> (If you can cut and paste that code on the jira that you create, it would be
> helpful!)... 
> 
> Thanks
> mahadev
> 
> 
> On 3/17/10 2:56 AM, "Łukasz Osipiuk" <lukasz@osipiuk.net> wrote:
> 
>> Ok.
>>
>> I analysed some logs from time when problem last occurred.
>>
>> brokensession.txt - logs concerning session which had a problem
>> leader.txt - logs concerning leader election (i grepped for Leader in
>> zookeeper.log)
>>
>> Some observations below
>> - Network glitch which resulted in problem occurred at about 22:08.
>> - From what I see since 17:48 node2 was the leader and it did not
>> change later yesterday.
>> - Client was connected to node2 since 17:50
>> - At around 22:09 client tried to connect to every node (1,2,3).
>> Connections to node1 and node3 were closed
>>   with exception "Exception causing close of session 0x22767e1c9630000
>> due to java.io.IOException: Read error".
>>   Connection to node2 stood alive.
>> - All subsequent operations were refused with ZSESSIONMOVED error.
>> Error visible both on client and on server side.
>>
>> Some more clarifications concerning questions from different posts:
>>
>> 1. Saying we have huge packet drop I mean we have huge packet drop for
>> a local network. We are still investigating source of it. From time to
>> time we observe that machines loose contact with others for few
>> seconds.
>>
>> 2. We do not use any fancy network tricks. All machines are in single
>> vlan and use permanently assigned IP addresses.
>>
>> Let me know if anything more can help.
>>
>> PS. Should I already create JIRA issue for this or is not concrete enough?
>>
>> Regards, Łukasz
>>
>> On Tue, Mar 16, 2010 at 20:39, Patrick Hunt <phunt@apache.org> wrote:
>>> Yea, that's great. (no complaints on my end, just wasn't sure what you
>>> meant, wanted to make sure I was clear.).
>>>
>>> Can you identify some sort of pattern? We're discussing on our end, but this
>>> is the first time we've heard of such an issue (outside 3.2.0 bug) and we're
>>> looking for a direction to hunt... anything you could add would help.
>>>
>>> Is it possible for you to upgrade your client/server to 3.2.2 and report the
>>> next failure, including any server/client logs if they are available? You
>>> should create a JIRA for this in order to track the details as they emerge.
>>> (incl attaching the logs, etc...)
>>>
>>> Patrick
>>>
>>> Łukasz Osipiuk wrote:
>>>> On Tue, Mar 16, 2010 at 20:05, Patrick Hunt <phunt@apache.org> wrote:
>>>>> We'll probably need the ZK server/client logs to hunt this down. Can
you
>>>>> tell if the MOVED happens in some particular scenario, say you are
>>>>> connected
>>>>> to a follower and move to a leader, or perhaps you are connected to
>>>>> server
>>>>> A, get disconnected and reconnected to server A? .... is there some
>>>>> pattern
>>>>> that could help us understand what's causing this?
>>>>>
>>>>>> Session gets broken after zookeeper library reconnects to cluster,
>>>>>> although most of reconnection leave session if working state.
>>>>> I'm not sure I follow this ^^^, could you restate it? (I get the first
>>>>> about
>>>>> about the session is broken after reconnect, what's the second part
>>>>> saying?)
>>>> Sorry for my English.
>>>>
>>>> I meant that it only happens from time to time. We get reconnects
>>>> quite often due to large packet loss in our network, but most of the
>>>> time they do not break the session.
>>>>
>>>> Is it clear now?
>>>>
>>>> Regards, Łukasz Osipiuk
>>>>
>>>>> Regards,
>>>>>
>>>>> Patrick
>>>>>
>>>>> Łukasz Osipiuk wrote:
>>>>>> On Tue, Mar 16, 2010 at 19:22, Patrick Hunt <phunt@apache.org>
wrote:
>>>>>>> Łukasz Osipiuk wrote:
>>>>>>>> On Tue, Mar 16, 2010 at 17:18, Patrick Hunt <phunt@apache.org>
wrote:
>>>>>>>>> Can you verify that you are using 3.2.2 on all servers?
You can do
>>>>>>>>> this
>>>>>>>>> by
>>>>>>>>> running the "stat" command against each of your servers
and look at
>>>>>>>>> the
>>>>>>>>> very
>>>>>>>>> top of the output (we include the version of zk server
there).
>>>>>>>>> http://bit.ly/dglVld
>>>>>>>> Thanks for hint.
>>>>>>>>
>>>>>>>> We had 3.2.2 jars installed on all server - I checked it
previously
>>>>>>>> but apparently our script which restarts instance
>>>>>>>> after installing .deb does not work very well and node2 and
node3 were
>>>>>>>> running version 3.2.1.
>>>>>>>> Could using 3.2.1 be source of the problem?
>>>>>>>>
>>>>>>> I heard this story where customer service for dell asks customers
to
>>>>>>> replug
>>>>>>> in the mouse because it might be "a dusty connection". Invariably
when
>>>>>>> ppl
>>>>>>> look at the back of the box they find that the mouse is not actually
>>>>>>> plugged
>>>>>>> in. Brilliant ideas from CS perspective. ;-)
>>>>>>>
>>>>>>> I think that issue was fixed in 3.2.1, so I don't think that's
an issue
>>>>>>> (version).
>>>>>> Damn :(
>>>>>>
>>>>>>>>> Are you using synchronous or async operations in your
client?
>>>>>>>> We are only using async operations.
>>>>>>> Ok. good info.
>>>>>>>
>>>>>>>>> I see snippets from the server logs, but I don't see
anything from
>>>>>>>>> you
>>>>>>>>> client logs. It would really help if you could zip up
all the logs
>>>>>>>>> (server
>>>>>>>>> and client) and provide them.
>>>>>>>> We do not really have much logging on application side :(.
Failed
>>>>>>>> operations are logged with error code (ZSESSIONEXPIRED in
our case).
>>>>>>>> We also log session state changes but we noticed only changes
between
>>>>>>>> 1
>>>>>>>> and 3.
>>>>>>>> Unfortunately we do not have logs coming directly from zookeeper
>>>>>>>> library as it logs to stderr/stdout and we do not store any
of them at
>>>>>>>> the moment.
>>>>>>>>
>>>>>>> I see, in this case I think that the log messages theyselves
are the
>>>>>>> problem, ie we are warning about this, but it's not a real problem.
We
>>>>>>> warn
>>>>>>> on things that we don't like but can handle, we error if we
>>>>>>> don'tlike/can'thandle.
>>>>>>>
>>>>>>> MOVED basically means that a client initiated a request on one
server,
>>>>>>> then
>>>>>>> moved to another server before the request could be processed.
>>>>>>>
>>>>>>> In your case you are using async requests, that can actually
make this
>>>>>>> problem show up more (more "in flight" requests vs sync operation).
>>>>>>>
>>>>>>> So if you are not seeing MOVED on the client then this is fine.
If you
>>>>>>> see
>>>>>>> them in the server log it means what I said above.
>>>>>>>
>>>>>>> That make sense, or am I missing something?
>>>>>> It seems I was what I wrote was not clear. We are noticing errors
on
>>>>>> client side. Any operation after
>>>>>> session gets into broken state fails with ZSESSIONMOVE error code
(I
>>>>>> wrote ZSESSIONEXPIRED by mistake in previous post).
>>>>>> We get ZSESSIONMOVE from callback on rc parameter.
>>>>>>
>>>>>> Session gets broken after zookeeper library reconnects to cluster,
>>>>>> although most of reconnection leave session if working state.
>>>>>>
>>>>>> Regards, Łukasz
>>>>>>
>>>>>>
>>>>>>> Patrick
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>> Regards, Łukasz
>>>>>>>>
>>>>>>>>> Patrick
>>>>>>>>>
>>>>>>>>> Łukasz Osipiuk wrote:
>>>>>>>>>> not really - it happens occasionally - every few
days :(
>>>>>>>>>> I believe it is somewhat connected with our network
environment
>>>>>>>>>> which
>>>>>>>>>> suffers from some packet loss which leads to
>>>>>>>>>> connection timeouts.
>>>>>>>>>>
>>>>>>>>>> I can switch on some more logging if you can lead
me which
>>>>>>>>>> categories
>>>>>>>>>> are worth to enable DEBUG for them.
>>>>>>>>>>
>>>>>>>>>> Regards, Łukasz Osipiuk
>>>>>>>>>>
>>>>>>>>>> On Tue, Mar 16, 2010 at 16:35, Benjamin Reed <breed@yahoo-inc.com>
>>>>>>>>>> wrote:
>>>>>>>>>>> weird, this does sound like a bug. do you have
a reliable way of
>>>>>>>>>>> reproducing
>>>>>>>>>>> the problem?
>>>>>>>>>>>
>>>>>>>>>>> thanx
>>>>>>>>>>> ben
>>>>>>>>>>>
>>>>>>>>>>> On 03/16/2010 08:27 AM, Łukasz Osipiuk wrote:
>>>>>>>>>>>> nope.
>>>>>>>>>>>>
>>>>>>>>>>>> I always pass 0 as clientid.
>>>>>>>>>>>>
>>>>>>>>>>>> Łukasz
>>>>>>>>>>>>
>>>>>>>>>>>> On Tue, Mar 16, 2010 at 16:20, Benjamin Reed<breed@yahoo-inc.com>
>>>>>>>>>>>>  wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> do you ever use zookeeper_init() with
the clientid field set to
>>>>>>>>>>>>> something
>>>>>>>>>>>>> other than null?
>>>>>>>>>>>>>
>>>>>>>>>>>>> ben
>>>>>>>>>>>>>
>>>>>>>>>>>>> On 03/16/2010 07:43 AM, Łukasz Osipiuk
wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>> Hi everyone!
>>>>>>>>>>>>>
>>>>>>>>>>>>> I am writing to this group because recently
we are getting some
>>>>>>>>>>>>> strange errors with our production zookeeper
setup.
>>>>>>>>>>>>>
>>>>>>>>>>>>>  From time to time we are observing that
our client application
>>>>>>>>>>>>> (C++
>>>>>>>>>>>>> based) disconnects from zookeeper (session
state is changed to
>>>>>>>>>>>>> 1)
>>>>>>>>>>>>> and
>>>>>>>>>>>>> reconnects (state changed to 3).
>>>>>>>>>>>>> This itself is not a problem - usually
application continues to
>>>>>>>>>>>>> run
>>>>>>>>>>>>> without problems after reconnect.
>>>>>>>>>>>>> But from time to time after above happens
all subsequent
>>>>>>>>>>>>> operations
>>>>>>>>>>>>> start to return ZSESSIONMOVED error.
To make it work again we
>>>>>>>>>>>>> have
>>>>>>>>>>>>> to
>>>>>>>>>>>>> restart application (which creates new
zookeeper session).
>>>>>>>>>>>>>
>>>>>>>>>>>>> I noticed that in 3.2.0 introduced a
bug
>>>>>>>>>>>>> http://issues.apache.org/jira/browse/ZOOKEEPER-449
but we are
>>>>>>>>>>>>> using
>>>>>>>>>>>>> zookeeper v. 3.2.2.
>>>>>>>>>>>>> I just noticed that app at compile time
used 3.2.0 library but
>>>>>>>>>>>>> patches
>>>>>>>>>>>>> fixing bug 449 did not touch C client
lib so I believe that our
>>>>>>>>>>>>> problems are not
>>>>>>>>>>>>> related with that.
>>>>>>>>>>>>>
>>>>>>>>>>>>> In zookeeper logs at moment which initiated
the problem with
>>>>>>>>>>>>> client
>>>>>>>>>>>>> application I have
>>>>>>>>>>>>>
>>>>>>>>>>>>> node1:
>>>>>>>>>>>>> 2010-03-16 14:21:43,510 - INFO
>>>>>>>>>>>>> [NIOServerCxn.Factory:2181:NIOServerCnxn@607]
- Connected to
>>>>>>>>>>>>> /10.1.112.61:37197 lastZxid 42992576502
>>>>>>>>>>>>> 2010-03-16 14:21:43,510 - INFO
>>>>>>>>>>>>> [NIOServerCxn.Factory:2181:NIOServerCnxn@636]
- Renewing session
>>>>>>>>>>>>> 0x324dcc1ba580085
>>>>>>>>>>>>> 2010-03-16 14:21:49,443 - INFO
>>>>>>>>>>>>> [QuorumPeer:/0:0:0:0:0:0:0:0:2181:NIOServerCnxn@992]
- Finished
>>>>>>>>>>>>> init
>>>>>>>>>>>>> of 0x324dcc1ba580085 valid:true
>>>>>>>>>>>>> 2010-03-16 14:21:49,443 - WARN
>>>>>>>>>>>>> [NIOServerCxn.Factory:2181:NIOServerCnxn@518]
- Exception
>>>>>>>>>>>>> causing
>>>>>>>>>>>>> close of session 0x324dcc1ba580085 due
to java.io.IOException:
>>>>>>>>>>>>> Read
>>>>>>>>>>>>> error
>>>>>>>>>>>>> 2010-03-16 14:21:49,444 - INFO
>>>>>>>>>>>>> [NIOServerCxn.Factory:2181:NIOServerCnxn@857]
- closing
>>>>>>>>>>>>> session:0x324dcc1ba580085 NIOServerCnxn:
>>>>>>>>>>>>> java.nio.channels.SocketChannel[connected
>>>>>>>>>>>>> local=/10.1.112.62:2181
>>>>>>>>>>>>> remote=/10.1.112.61:37197]
>>>>>>>>>>>>>
>>>>>>>>>>>>> node2:
>>>>>>>>>>>>> 2010-03-16 14:21:40,580 - WARN
>>>>>>>>>>>>> [NIOServerCxn.Factory:2181:NIOServerCnxn@494]
- Exception
>>>>>>>>>>>>> causing
>>>>>>>>>>>>> close of session 0x324dcc1ba580085 due
to java.io.IOException:
>>>>>>>>>>>>> Read
>>>>>>>>>>>>> error
>>>>>>>>>>>>> 2010-03-16 14:21:40,581 - INFO
>>>>>>>>>>>>> [NIOServerCxn.Factory:2181:NIOServerCnxn@833]
- closing
>>>>>>>>>>>>> session:0x324dcc1ba580085 NIOServerCnxn:
>>>>>>>>>>>>> java.nio.channels.SocketChannel[connected
>>>>>>>>>>>>> local=/10.1.112.63:2181
>>>>>>>>>>>>> remote=/10.1.112.61:60693]
>>>>>>>>>>>>> 2010-03-16 14:21:46,839 - INFO
>>>>>>>>>>>>> [NIOServerCxn.Factory:2181:NIOServerCnxn@583]
- Connected to
>>>>>>>>>>>>> /10.1.112.61:48336 lastZxid 42992576502
>>>>>>>>>>>>> 2010-03-16 14:21:46,839 - INFO
>>>>>>>>>>>>> [NIOServerCxn.Factory:2181:NIOServerCnxn@612]
- Renewing session
>>>>>>>>>>>>> 0x324dcc1ba580085
>>>>>>>>>>>>> 2010-03-16 14:21:49,439 - INFO
>>>>>>>>>>>>> [QuorumPeer:/0:0:0:0:0:0:0:0:2181:NIOServerCnxn@964]
- Finished
>>>>>>>>>>>>> init
>>>>>>>>>>>>> of 0x324dcc1ba580085 valid:true
>>>>>>>>>>>>>
>>>>>>>>>>>>> node3:
>>>>>>>>>>>>> 2010-03-16 02:14:48,961 - WARN
>>>>>>>>>>>>> [NIOServerCxn.Factory:2181:NIOServerCnxn@494]
- Exception
>>>>>>>>>>>>> causing
>>>>>>>>>>>>> close of session 0x324dcc1ba580085 due
to java.io.IOException:
>>>>>>>>>>>>> Read
>>>>>>>>>>>>> error
>>>>>>>>>>>>> 2010-03-16 02:14:48,962 - INFO
>>>>>>>>>>>>> [NIOServerCxn.Factory:2181:NIOServerCnxn@833]
- closing
>>>>>>>>>>>>> session:0x324dcc1ba580085 NIOServerCnxn:
>>>>>>>>>>>>> java.nio.channels.SocketChannel[connected
>>>>>>>>>>>>> local=/10.1.112.64:2181
>>>>>>>>>>>>> remote=/10.1.112.61:57309]
>>>>>>>>>>>>>
>>>>>>>>>>>>> and then lots of entries like this
>>>>>>>>>>>>> 2010-03-16 02:14:54,696 - WARN
>>>>>>>>>>>>> [ProcessThread:-1:PrepRequestProcessor@402]
- Got exception when
>>>>>>>>>>>>> processing sessionid:0x324dcc1ba580085
type:create
>>>>>>>>>>>>> cxid:0x4b9e9e49
>>>>>>>>>>>>> zxid:0xfffffffffffffffe txntype:unknown
>>>>>>>>>>>>> /locks/9871253/lock-8589943989-
>>>>>>>>>>>>> org.apache.zookeeper.KeeperException$SessionMovedException:
>>>>>>>>>>>>> KeeperErrorCode = Session moved
>>>>>>>>>>>>>     at
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> org.apache.zookeeper.server.SessionTrackerImpl.checkSession(Session
>>>>>>>>>>>>> TrackerImpl.java:231)
>>>>>>>>>>>>>     at
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> org.apache.zookeeper.server.PrepRequestProcessor.pRequest(PrepReque
>>>>>>>>>>>>> stProcessor.java:211)
>>>>>>>>>>>>>     at
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> org.apache.zookeeper.server.PrepRequestProcessor.run(PrepRequestPro
>>>>>>>>>>>>> cessor.java:114)
>>>>>>>>>>>>> 2010-03-16 14:22:06,428 - WARN
>>>>>>>>>>>>> [ProcessThread:-1:PrepRequestProcessor@402]
- Got exception when
>>>>>>>>>>>>> processing sessionid:0x324dcc1ba580085
type:create
>>>>>>>>>>>>> cxid:0x4b9f6603
>>>>>>>>>>>>> zxid:0xfffffffffffffffe txntype:unknown
>>>>>>>>>>>>> /locks/1665960/lock-8589961006-
>>>>>>>>>>>>> org.apache.zookeeper.KeeperException$SessionMovedException:
>>>>>>>>>>>>> KeeperErrorCode = Session moved
>>>>>>>>>>>>>     at
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> org.apache.zookeeper.server.SessionTrackerImpl.checkSession(Session
>>>>>>>>>>>>> TrackerImpl.java:231)
>>>>>>>>>>>>>     at
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> org.apache.zookeeper.server.PrepRequestProcessor.pRequest(PrepReque
>>>>>>>>>>>>> stProcessor.java:211)
>>>>>>>>>>>>>     at
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> org.apache.zookeeper.server.PrepRequestProcessor.run(PrepRequestPro
>>>>>>>>>>>>> cessor.java:114)
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> To workaround disconnections I am going
to increase session
>>>>>>>>>>>>> timeout
>>>>>>>>>>>>> from 5 to 15 seconds but event if it
helps at all it is just a
>>>>>>>>>>>>> workaround.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Do you have an idea where is the source
of my problem.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Regards, Łukasz Osipiuk
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>
>>>>
>>>>
>>
> 

Mime
View raw message