zookeeper-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Patrick Hunt <ph...@apache.org>
Subject Re: ephemeral node not deleted after client session closed
Date Thu, 10 Nov 2011 20:20:02 GMT
I  believe I see the issue.

PrepRequestProcessor checks the session is valid on a create.

However it doesn't not know about an in-progress close session request

The session is not invalidated (session tracker) until
FinalRequestProcessor. At which point a create in PrepRP would fail.
However prior to that it will get into the pipeline. Resulting in the
issue you are seeing.


Q: what are you clients doing? It's weird that a create would come
from the client after the session has been closed. Do you have
multiple threads sharing a session?


Patrick

On Thu, Nov 10, 2011 at 11:54 AM, Patrick Hunt <phunt@apache.org> wrote:
> ps. was there any instability in the quorum itself during this time
> period? Attaching logs to 1208 would be helpful.
>
> Patrick
>
> On Thu, Nov 10, 2011 at 11:50 AM, Patrick Hunt <phunt@apache.org> wrote:
>> Great report, some addl questions:
>>
>> 1) Why is the session closed, the client closed it or the cluster expired it?
>>
>> 2) which server was the session attached to - the first (44sec max
>> lat) or one of the others? Which server was the leader?
>>
>> 3) the znode exists on all 4 servers, is that right?
>>
>> 4) it's clear that a znode is being created for a session that's just
>> been closed. This is certainly a bug. Perhaps a timing issue due to
>> the huge latency hit that you reported.
>>
>> 5) why are your max latencies, as well as avg latency, so high?
>> a) are these dedicated boxes, not virtualized, correct?
>> b) is the jvm going into gc pause? (try turning on verbose logging, or
>> use "jstat" with the gc options to see the history if you still have
>> those jvms running)
>> c) swapping?
>> d) do you have dedicated spindles for the ZK WAL? If not another
>> process might be causing the fsyncs to pause. (you can use iostat or
>> strace to monitor this)
>>
>> My guess is b given it's the time btw the commit processor and
>> finalreq processor that's so long. Is that the log from the server
>> that's got the 44sec max latency?
>>
>> Patrick
>>
>> On Thu, Nov 10, 2011 at 11:15 AM, Neha Narkhede <neha.narkhede@gmail.com> wrote:
>>> Hi,
>>>
>>> We are seeing this issue again with our Kafka zookeeper setup. Let me
>>> explain with the following log entries -
>>>
>>> Here is the node that didn't get deleted -
>>>
>>> get /kafka-tracking/brokers/ids/528
>>> 172.18.98.101-1320781574264:172.18.98.101:10251
>>> cZxid = 0x605f90a78ctime = Tue Nov 08 11:46:26 PST 2011
>>> mZxid = 0x605f90a78mtime = Tue Nov 08 11:46:26 PST 2011
>>> pZxid = 0x605f90a78cversion = 0
>>> dataVersion = 0
>>> aclVersion = 0
>>> ephemeralOwner = 0x23220b93b4a33bc
>>> dataLength = 47
>>> numChildren = 0
>>>
>>> I ran Kishore's awesome version of the LogFormatter tool on the
>>> relevant portion of the zookeeper transaction logs -
>>>
>>> time:11/8/11 11:46:14 AM PST session:0x23220b93b4a33bc cxid:0x0
>>> zxid:0x605f90a59 type:createSession timeOut:6000
>>> time:11/8/11 11:46:26 AM PST session:0x23220b93b4a33bc cxid:0x0
>>> zxid:0x605f90a72 type:closeSession
>>> time:11/8/11 11:46:26 AM PST session:0x23220b93b4a33bc cxid:0x1
>>> zxid:0x605f90a78 type:create acl:[31,s{'world,'anyone}
>>>
>>> We want to understand why there was a 12 second delay in creation of
>>> znode after createSession
>>>
>>> Checking the logs, we found that the there is a 30+ sec delay between
>>> the commitProcessor and FinalRequestProcessor.
>>>
>>> 2011-11-08 11:46:14,661 - DEBUG
>>> [QuorumPeer:/0:0:0:0:0:0:0:0:12913:CommitProcessor@159] - Committing
>>> request:: sessionid:0x23220b93b4a33bc type:createSession
>>>  cxid:0x0 zxid:0x605f90a59 txntype:-10 reqpath:n/a
>>> 2011-11-08 11:46:26,045 - DEBUG
>>> [QuorumPeer:/0:0:0:0:0:0:0:0:12913:CommitProcessor@159] - Committing
>>> request:: sessionid:0x23220b93b4a33bc type:closeSession
>>> cxid:0x0 zxid:0x605f90a72 txntype:-11 reqpath:n/a
>>> 2011-11-08 11:46:26,047 - DEBUG
>>> [QuorumPeer:/0:0:0:0:0:0:0:0:12913:CommitProcessor@159] - Committing
>>> request:: sessionid:0x23220b93b4a33bc type:create cxid:0
>>> x1 zxid:0x605f90a78 txntype:1 reqpath:n/a
>>> 2011-11-08 11:46:59,539 - DEBUG
>>> [CommitProcessor:1:FinalRequestProcessor@78] - Processing request::
>>> sessionid:0x23220b93b4a33bc type:createSession cxid:0x0 z
>>> xid:0x605f90a59 txntype:-10 reqpath:n/a
>>> 2011-11-08 11:46:59,587 - DEBUG
>>> [CommitProcessor:1:FinalRequestProcessor@78] - Processing request::
>>> sessionid:0x23220b93b4a33bc type:closeSession cxid:0x0 zx
>>> id:0x605f90a72 txntype:-11 reqpath:n/a
>>> 2011-11-08 11:46:59,589 - DEBUG
>>> [CommitProcessor:1:FinalRequestProcessor@78] - Processing request::
>>> sessionid:0x23220b93b4a33bc type:create cxid:0x1 zxid:0x6
>>> 05f90a78 txntype:1 reqpath:n/a
>>>
>>> Here is the stat output from our zookeeper servers -
>>>
>>> Latency min/avg/max: 0/53/44712
>>> Latency min/avg/max: 0/11/11727
>>> Latency min/avg/max: 0/12/11994
>>> Latency min/avg/max: 0/9/11707
>>>
>>> The relevant bug is here - https://issues.apache.org/jira/browse/ZOOKEEPER-1208
>>>
>>> This is a pretty serious issue since the fix is to have someone
>>> manually delete the ephemeral node. And until that is done, it blocks
>>> the startup of either the consumer client or the Kafka servers. Any
>>> help here in debugging this issue is appreciated !
>>>
>>> Thanks,
>>> Neha
>>>
>>
>

Mime
View raw message