hadoop-zookeeper-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Patrick Hunt <ph...@apache.org>
Subject Re: permanent ZSESSIONMOVED
Date Thu, 18 Mar 2010 20:23:45 GMT
We've identified the problem and are working on a fix. Full details here:

https://issues.apache.org/jira/browse/ZOOKEEPER-710

Patrick

Łukasz Osipiuk wrote:
> Hello Mahadev,
> 
> 2010/3/18 Mahadev Konar <mahadev@yahoo-inc.com>:
>> Hi Lukasz,
>>  I looked at the logs and havent been able to determine how the session
>> move can happen.
>>
>>
>>  You should go ahead and open a jira for this! Please attach the logs to
>> that jira. Mark it for 3.3.0 until we can say that it isnt a problem as of
>> now.
> 
> I created issue for this
> https://issues.apache.org/jira/browse/ZOOKEEPER-710. I marked it as
> 3.2.2 as this is version we are using.
> 
>> Also, please go ahead and attach non filtered logs to the jira for the same
>> timeline as you had done earlier (by non filtered I mean not grepping for
>> the session id). Also, do attach the earlier logs you had emailed.
> 
> I attached logs in jira.
> 
>> Also, may I know what function are you using to print that statement
>>
>>
>> Mar 16 22:09:01 ggmail-c1-mbox-1 ggmail_mbox[7976]: ERROR: ZooKeeper:
>> ZooKeeper::getChildren() path='/ggmboxlocks/16314'; callback error='session
>> moved to another server, so operation is ignored'
> 
> info added to jira
> 
>> (If you can cut and paste that code on the jira that you create, it would be
>> helpful!)...
>>
>> Thanks
>> mahadev
>>
>>
>> On 3/17/10 2:56 AM, "Łukasz Osipiuk" <lukasz@osipiuk.net> wrote:
>>
>>> Ok.
>>>
>>> I analysed some logs from time when problem last occurred.
>>>
>>> brokensession.txt - logs concerning session which had a problem
>>> leader.txt - logs concerning leader election (i grepped for Leader in
>>> zookeeper.log)
>>>
>>> Some observations below
>>> - Network glitch which resulted in problem occurred at about 22:08.
>>> - From what I see since 17:48 node2 was the leader and it did not
>>> change later yesterday.
>>> - Client was connected to node2 since 17:50
>>> - At around 22:09 client tried to connect to every node (1,2,3).
>>> Connections to node1 and node3 were closed
>>>   with exception "Exception causing close of session 0x22767e1c9630000
>>> due to java.io.IOException: Read error".
>>>   Connection to node2 stood alive.
>>> - All subsequent operations were refused with ZSESSIONMOVED error.
>>> Error visible both on client and on server side.
>>>
>>> Some more clarifications concerning questions from different posts:
>>>
>>> 1. Saying we have huge packet drop I mean we have huge packet drop for
>>> a local network. We are still investigating source of it. From time to
>>> time we observe that machines loose contact with others for few
>>> seconds.
>>>
>>> 2. We do not use any fancy network tricks. All machines are in single
>>> vlan and use permanently assigned IP addresses.
>>>
>>> Let me know if anything more can help.
>>>
>>> PS. Should I already create JIRA issue for this or is not concrete enough?
>>>
>>> Regards, Łukasz
>>>
>>> On Tue, Mar 16, 2010 at 20:39, Patrick Hunt <phunt@apache.org> wrote:
>>>> Yea, that's great. (no complaints on my end, just wasn't sure what you
>>>> meant, wanted to make sure I was clear.).
>>>>
>>>> Can you identify some sort of pattern? We're discussing on our end, but this
>>>> is the first time we've heard of such an issue (outside 3.2.0 bug) and we're
>>>> looking for a direction to hunt... anything you could add would help.
>>>>
>>>> Is it possible for you to upgrade your client/server to 3.2.2 and report
the
>>>> next failure, including any server/client logs if they are available? You
>>>> should create a JIRA for this in order to track the details as they emerge.
>>>> (incl attaching the logs, etc...)
>>>>
>>>> Patrick
>>>>
>>>> Łukasz Osipiuk wrote:
>>>>> On Tue, Mar 16, 2010 at 20:05, Patrick Hunt <phunt@apache.org>
wrote:
>>>>>> We'll probably need the ZK server/client logs to hunt this down.
Can you
>>>>>> tell if the MOVED happens in some particular scenario, say you are
>>>>>> connected
>>>>>> to a follower and move to a leader, or perhaps you are connected
to
>>>>>> server
>>>>>> A, get disconnected and reconnected to server A? .... is there some
>>>>>> pattern
>>>>>> that could help us understand what's causing this?
>>>>>>
>>>>>>> Session gets broken after zookeeper library reconnects to cluster,
>>>>>>> although most of reconnection leave session if working state.
>>>>>> I'm not sure I follow this ^^^, could you restate it? (I get the
first
>>>>>> about
>>>>>> about the session is broken after reconnect, what's the second part
>>>>>> saying?)
>>>>> Sorry for my English.
>>>>>
>>>>> I meant that it only happens from time to time. We get reconnects
>>>>> quite often due to large packet loss in our network, but most of the
>>>>> time they do not break the session.
>>>>>
>>>>> Is it clear now?
>>>>>
>>>>> Regards, Łukasz Osipiuk
>>>>>
>>>>>> Regards,
>>>>>>
>>>>>> Patrick
>>>>>>
>>>>>> Łukasz Osipiuk wrote:
>>>>>>> On Tue, Mar 16, 2010 at 19:22, Patrick Hunt <phunt@apache.org>
wrote:
>>>>>>>> Łukasz Osipiuk wrote:
>>>>>>>>> On Tue, Mar 16, 2010 at 17:18, Patrick Hunt <phunt@apache.org>
wrote:
>>>>>>>>>> Can you verify that you are using 3.2.2 on all servers?
You can do
>>>>>>>>>> this
>>>>>>>>>> by
>>>>>>>>>> running the "stat" command against each of your servers
and look at
>>>>>>>>>> the
>>>>>>>>>> very
>>>>>>>>>> top of the output (we include the version of zk server
there).
>>>>>>>>>> http://bit.ly/dglVld
>>>>>>>>> Thanks for hint.
>>>>>>>>>
>>>>>>>>> We had 3.2.2 jars installed on all server - I checked
it previously
>>>>>>>>> but apparently our script which restarts instance
>>>>>>>>> after installing .deb does not work very well and node2
and node3 were
>>>>>>>>> running version 3.2.1.
>>>>>>>>> Could using 3.2.1 be source of the problem?
>>>>>>>>>
>>>>>>>> I heard this story where customer service for dell asks customers
to
>>>>>>>> replug
>>>>>>>> in the mouse because it might be "a dusty connection". Invariably
when
>>>>>>>> ppl
>>>>>>>> look at the back of the box they find that the mouse is not
actually
>>>>>>>> plugged
>>>>>>>> in. Brilliant ideas from CS perspective. ;-)
>>>>>>>>
>>>>>>>> I think that issue was fixed in 3.2.1, so I don't think that's
an issue
>>>>>>>> (version).
>>>>>>> Damn :(
>>>>>>>
>>>>>>>>>> Are you using synchronous or async operations in
your client?
>>>>>>>>> We are only using async operations.
>>>>>>>> Ok. good info.
>>>>>>>>
>>>>>>>>>> I see snippets from the server logs, but I don't
see anything from
>>>>>>>>>> you
>>>>>>>>>> client logs. It would really help if you could zip
up all the logs
>>>>>>>>>> (server
>>>>>>>>>> and client) and provide them.
>>>>>>>>> We do not really have much logging on application side
:(. Failed
>>>>>>>>> operations are logged with error code (ZSESSIONEXPIRED
in our case).
>>>>>>>>> We also log session state changes but we noticed only
changes between
>>>>>>>>> 1
>>>>>>>>> and 3.
>>>>>>>>> Unfortunately we do not have logs coming directly from
zookeeper
>>>>>>>>> library as it logs to stderr/stdout and we do not store
any of them at
>>>>>>>>> the moment.
>>>>>>>>>
>>>>>>>> I see, in this case I think that the log messages theyselves
are the
>>>>>>>> problem, ie we are warning about this, but it's not a real
problem. We
>>>>>>>> warn
>>>>>>>> on things that we don't like but can handle, we error if
we
>>>>>>>> don'tlike/can'thandle.
>>>>>>>>
>>>>>>>> MOVED basically means that a client initiated a request on
one server,
>>>>>>>> then
>>>>>>>> moved to another server before the request could be processed.
>>>>>>>>
>>>>>>>> In your case you are using async requests, that can actually
make this
>>>>>>>> problem show up more (more "in flight" requests vs sync operation).
>>>>>>>>
>>>>>>>> So if you are not seeing MOVED on the client then this is
fine. If you
>>>>>>>> see
>>>>>>>> them in the server log it means what I said above.
>>>>>>>>
>>>>>>>> That make sense, or am I missing something?
>>>>>>> It seems I was what I wrote was not clear. We are noticing errors
on
>>>>>>> client side. Any operation after
>>>>>>> session gets into broken state fails with ZSESSIONMOVE error
code (I
>>>>>>> wrote ZSESSIONEXPIRED by mistake in previous post).
>>>>>>> We get ZSESSIONMOVE from callback on rc parameter.
>>>>>>>
>>>>>>> Session gets broken after zookeeper library reconnects to cluster,
>>>>>>> although most of reconnection leave session if working state.
>>>>>>>
>>>>>>> Regards, Łukasz
>>>>>>>
>>>>>>>
>>>>>>>> Patrick
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>> Regards, Łukasz
>>>>>>>>>
>>>>>>>>>> Patrick
>>>>>>>>>>
>>>>>>>>>> Łukasz Osipiuk wrote:
>>>>>>>>>>> not really - it happens occasionally - every
few days :(
>>>>>>>>>>> I believe it is somewhat connected with our network
environment
>>>>>>>>>>> which
>>>>>>>>>>> suffers from some packet loss which leads to
>>>>>>>>>>> connection timeouts.
>>>>>>>>>>>
>>>>>>>>>>> I can switch on some more logging if you can
lead me which
>>>>>>>>>>> categories
>>>>>>>>>>> are worth to enable DEBUG for them.
>>>>>>>>>>>
>>>>>>>>>>> Regards, Łukasz Osipiuk
>>>>>>>>>>>
>>>>>>>>>>> On Tue, Mar 16, 2010 at 16:35, Benjamin Reed
<breed@yahoo-inc.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>>> weird, this does sound like a bug. do you
have a reliable way of
>>>>>>>>>>>> reproducing
>>>>>>>>>>>> the problem?
>>>>>>>>>>>>
>>>>>>>>>>>> thanx
>>>>>>>>>>>> ben
>>>>>>>>>>>>
>>>>>>>>>>>> On 03/16/2010 08:27 AM, Łukasz Osipiuk wrote:
>>>>>>>>>>>>> nope.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I always pass 0 as clientid.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Łukasz
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Tue, Mar 16, 2010 at 16:20, Benjamin
Reed<breed@yahoo-inc.com>
>>>>>>>>>>>>>  wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> do you ever use zookeeper_init()
with the clientid field set to
>>>>>>>>>>>>>> something
>>>>>>>>>>>>>> other than null?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> ben
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On 03/16/2010 07:43 AM, Łukasz Osipiuk
wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi everyone!
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I am writing to this group because
recently we are getting some
>>>>>>>>>>>>>> strange errors with our production
zookeeper setup.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>  From time to time we are observing
that our client application
>>>>>>>>>>>>>> (C++
>>>>>>>>>>>>>> based) disconnects from zookeeper
(session state is changed to
>>>>>>>>>>>>>> 1)
>>>>>>>>>>>>>> and
>>>>>>>>>>>>>> reconnects (state changed to 3).
>>>>>>>>>>>>>> This itself is not a problem - usually
application continues to
>>>>>>>>>>>>>> run
>>>>>>>>>>>>>> without problems after reconnect.
>>>>>>>>>>>>>> But from time to time after above
happens all subsequent
>>>>>>>>>>>>>> operations
>>>>>>>>>>>>>> start to return ZSESSIONMOVED error.
To make it work again we
>>>>>>>>>>>>>> have
>>>>>>>>>>>>>> to
>>>>>>>>>>>>>> restart application (which creates
new zookeeper session).
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I noticed that in 3.2.0 introduced
a bug
>>>>>>>>>>>>>> http://issues.apache.org/jira/browse/ZOOKEEPER-449
but we are
>>>>>>>>>>>>>> using
>>>>>>>>>>>>>> zookeeper v. 3.2.2.
>>>>>>>>>>>>>> I just noticed that app at compile
time used 3.2.0 library but
>>>>>>>>>>>>>> patches
>>>>>>>>>>>>>> fixing bug 449 did not touch C client
lib so I believe that our
>>>>>>>>>>>>>> problems are not
>>>>>>>>>>>>>> related with that.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> In zookeeper logs at moment which
initiated the problem with
>>>>>>>>>>>>>> client
>>>>>>>>>>>>>> application I have
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> node1:
>>>>>>>>>>>>>> 2010-03-16 14:21:43,510 - INFO
>>>>>>>>>>>>>> [NIOServerCxn.Factory:2181:NIOServerCnxn@607]
- Connected to
>>>>>>>>>>>>>> /10.1.112.61:37197 lastZxid 42992576502
>>>>>>>>>>>>>> 2010-03-16 14:21:43,510 - INFO
>>>>>>>>>>>>>> [NIOServerCxn.Factory:2181:NIOServerCnxn@636]
- Renewing session
>>>>>>>>>>>>>> 0x324dcc1ba580085
>>>>>>>>>>>>>> 2010-03-16 14:21:49,443 - INFO
>>>>>>>>>>>>>> [QuorumPeer:/0:0:0:0:0:0:0:0:2181:NIOServerCnxn@992]
- Finished
>>>>>>>>>>>>>> init
>>>>>>>>>>>>>> of 0x324dcc1ba580085 valid:true
>>>>>>>>>>>>>> 2010-03-16 14:21:49,443 - WARN
>>>>>>>>>>>>>> [NIOServerCxn.Factory:2181:NIOServerCnxn@518]
- Exception
>>>>>>>>>>>>>> causing
>>>>>>>>>>>>>> close of session 0x324dcc1ba580085
due to java.io.IOException:
>>>>>>>>>>>>>> Read
>>>>>>>>>>>>>> error
>>>>>>>>>>>>>> 2010-03-16 14:21:49,444 - INFO
>>>>>>>>>>>>>> [NIOServerCxn.Factory:2181:NIOServerCnxn@857]
- closing
>>>>>>>>>>>>>> session:0x324dcc1ba580085 NIOServerCnxn:
>>>>>>>>>>>>>> java.nio.channels.SocketChannel[connected
>>>>>>>>>>>>>> local=/10.1.112.62:2181
>>>>>>>>>>>>>> remote=/10.1.112.61:37197]
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> node2:
>>>>>>>>>>>>>> 2010-03-16 14:21:40,580 - WARN
>>>>>>>>>>>>>> [NIOServerCxn.Factory:2181:NIOServerCnxn@494]
- Exception
>>>>>>>>>>>>>> causing
>>>>>>>>>>>>>> close of session 0x324dcc1ba580085
due to java.io.IOException:
>>>>>>>>>>>>>> Read
>>>>>>>>>>>>>> error
>>>>>>>>>>>>>> 2010-03-16 14:21:40,581 - INFO
>>>>>>>>>>>>>> [NIOServerCxn.Factory:2181:NIOServerCnxn@833]
- closing
>>>>>>>>>>>>>> session:0x324dcc1ba580085 NIOServerCnxn:
>>>>>>>>>>>>>> java.nio.channels.SocketChannel[connected
>>>>>>>>>>>>>> local=/10.1.112.63:2181
>>>>>>>>>>>>>> remote=/10.1.112.61:60693]
>>>>>>>>>>>>>> 2010-03-16 14:21:46,839 - INFO
>>>>>>>>>>>>>> [NIOServerCxn.Factory:2181:NIOServerCnxn@583]
- Connected to
>>>>>>>>>>>>>> /10.1.112.61:48336 lastZxid 42992576502
>>>>>>>>>>>>>> 2010-03-16 14:21:46,839 - INFO
>>>>>>>>>>>>>> [NIOServerCxn.Factory:2181:NIOServerCnxn@612]
- Renewing session
>>>>>>>>>>>>>> 0x324dcc1ba580085
>>>>>>>>>>>>>> 2010-03-16 14:21:49,439 - INFO
>>>>>>>>>>>>>> [QuorumPeer:/0:0:0:0:0:0:0:0:2181:NIOServerCnxn@964]
- Finished
>>>>>>>>>>>>>> init
>>>>>>>>>>>>>> of 0x324dcc1ba580085 valid:true
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> node3:
>>>>>>>>>>>>>> 2010-03-16 02:14:48,961 - WARN
>>>>>>>>>>>>>> [NIOServerCxn.Factory:2181:NIOServerCnxn@494]
- Exception
>>>>>>>>>>>>>> causing
>>>>>>>>>>>>>> close of session 0x324dcc1ba580085
due to java.io.IOException:
>>>>>>>>>>>>>> Read
>>>>>>>>>>>>>> error
>>>>>>>>>>>>>> 2010-03-16 02:14:48,962 - INFO
>>>>>>>>>>>>>> [NIOServerCxn.Factory:2181:NIOServerCnxn@833]
- closing
>>>>>>>>>>>>>> session:0x324dcc1ba580085 NIOServerCnxn:
>>>>>>>>>>>>>> java.nio.channels.SocketChannel[connected
>>>>>>>>>>>>>> local=/10.1.112.64:2181
>>>>>>>>>>>>>> remote=/10.1.112.61:57309]
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> and then lots of entries like this
>>>>>>>>>>>>>> 2010-03-16 02:14:54,696 - WARN
>>>>>>>>>>>>>> [ProcessThread:-1:PrepRequestProcessor@402]
- Got exception when
>>>>>>>>>>>>>> processing sessionid:0x324dcc1ba580085
type:create
>>>>>>>>>>>>>> cxid:0x4b9e9e49
>>>>>>>>>>>>>> zxid:0xfffffffffffffffe txntype:unknown
>>>>>>>>>>>>>> /locks/9871253/lock-8589943989-
>>>>>>>>>>>>>> org.apache.zookeeper.KeeperException$SessionMovedException:
>>>>>>>>>>>>>> KeeperErrorCode = Session moved
>>>>>>>>>>>>>>     at
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> org.apache.zookeeper.server.SessionTrackerImpl.checkSession(Session
>>>>>>>>>>>>>> TrackerImpl.java:231)
>>>>>>>>>>>>>>     at
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> org.apache.zookeeper.server.PrepRequestProcessor.pRequest(PrepReque
>>>>>>>>>>>>>> stProcessor.java:211)
>>>>>>>>>>>>>>     at
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> org.apache.zookeeper.server.PrepRequestProcessor.run(PrepRequestPro
>>>>>>>>>>>>>> cessor.java:114)
>>>>>>>>>>>>>> 2010-03-16 14:22:06,428 - WARN
>>>>>>>>>>>>>> [ProcessThread:-1:PrepRequestProcessor@402]
- Got exception when
>>>>>>>>>>>>>> processing sessionid:0x324dcc1ba580085
type:create
>>>>>>>>>>>>>> cxid:0x4b9f6603
>>>>>>>>>>>>>> zxid:0xfffffffffffffffe txntype:unknown
>>>>>>>>>>>>>> /locks/1665960/lock-8589961006-
>>>>>>>>>>>>>> org.apache.zookeeper.KeeperException$SessionMovedException:
>>>>>>>>>>>>>> KeeperErrorCode = Session moved
>>>>>>>>>>>>>>     at
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> org.apache.zookeeper.server.SessionTrackerImpl.checkSession(Session
>>>>>>>>>>>>>> TrackerImpl.java:231)
>>>>>>>>>>>>>>     at
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> org.apache.zookeeper.server.PrepRequestProcessor.pRequest(PrepReque
>>>>>>>>>>>>>> stProcessor.java:211)
>>>>>>>>>>>>>>     at
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> org.apache.zookeeper.server.PrepRequestProcessor.run(PrepRequestPro
>>>>>>>>>>>>>> cessor.java:114)
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> To workaround disconnections I am
going to increase session
>>>>>>>>>>>>>> timeout
>>>>>>>>>>>>>> from 5 to 15 seconds but event if
it helps at all it is just a
>>>>>>>>>>>>>> workaround.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Do you have an idea where is the
source of my problem.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Regards, Łukasz Osipiuk
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>
>>>>>
>>>>>
>>>
>>
> 
> 
> 

Mime
View raw message