zookeeper-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Patrick Hunt <ph...@apache.org>
Subject Re: ephemeral node not deleted after client session closed
Date Thu, 10 Nov 2011 19:50:05 GMT
Great report, some addl questions:

1) Why is the session closed, the client closed it or the cluster expired it?

2) which server was the session attached to - the first (44sec max
lat) or one of the others? Which server was the leader?

3) the znode exists on all 4 servers, is that right?

4) it's clear that a znode is being created for a session that's just
been closed. This is certainly a bug. Perhaps a timing issue due to
the huge latency hit that you reported.

5) why are your max latencies, as well as avg latency, so high?
a) are these dedicated boxes, not virtualized, correct?
b) is the jvm going into gc pause? (try turning on verbose logging, or
use "jstat" with the gc options to see the history if you still have
those jvms running)
c) swapping?
d) do you have dedicated spindles for the ZK WAL? If not another
process might be causing the fsyncs to pause. (you can use iostat or
strace to monitor this)

My guess is b given it's the time btw the commit processor and
finalreq processor that's so long. Is that the log from the server
that's got the 44sec max latency?

Patrick

On Thu, Nov 10, 2011 at 11:15 AM, Neha Narkhede <neha.narkhede@gmail.com> wrote:
> Hi,
>
> We are seeing this issue again with our Kafka zookeeper setup. Let me
> explain with the following log entries -
>
> Here is the node that didn't get deleted -
>
> get /kafka-tracking/brokers/ids/528
> 172.18.98.101-1320781574264:172.18.98.101:10251
> cZxid = 0x605f90a78ctime = Tue Nov 08 11:46:26 PST 2011
> mZxid = 0x605f90a78mtime = Tue Nov 08 11:46:26 PST 2011
> pZxid = 0x605f90a78cversion = 0
> dataVersion = 0
> aclVersion = 0
> ephemeralOwner = 0x23220b93b4a33bc
> dataLength = 47
> numChildren = 0
>
> I ran Kishore's awesome version of the LogFormatter tool on the
> relevant portion of the zookeeper transaction logs -
>
> time:11/8/11 11:46:14 AM PST session:0x23220b93b4a33bc cxid:0x0
> zxid:0x605f90a59 type:createSession timeOut:6000
> time:11/8/11 11:46:26 AM PST session:0x23220b93b4a33bc cxid:0x0
> zxid:0x605f90a72 type:closeSession
> time:11/8/11 11:46:26 AM PST session:0x23220b93b4a33bc cxid:0x1
> zxid:0x605f90a78 type:create acl:[31,s{'world,'anyone}
>
> We want to understand why there was a 12 second delay in creation of
> znode after createSession
>
> Checking the logs, we found that the there is a 30+ sec delay between
> the commitProcessor and FinalRequestProcessor.
>
> 2011-11-08 11:46:14,661 - DEBUG
> [QuorumPeer:/0:0:0:0:0:0:0:0:12913:CommitProcessor@159] - Committing
> request:: sessionid:0x23220b93b4a33bc type:createSession
>  cxid:0x0 zxid:0x605f90a59 txntype:-10 reqpath:n/a
> 2011-11-08 11:46:26,045 - DEBUG
> [QuorumPeer:/0:0:0:0:0:0:0:0:12913:CommitProcessor@159] - Committing
> request:: sessionid:0x23220b93b4a33bc type:closeSession
> cxid:0x0 zxid:0x605f90a72 txntype:-11 reqpath:n/a
> 2011-11-08 11:46:26,047 - DEBUG
> [QuorumPeer:/0:0:0:0:0:0:0:0:12913:CommitProcessor@159] - Committing
> request:: sessionid:0x23220b93b4a33bc type:create cxid:0
> x1 zxid:0x605f90a78 txntype:1 reqpath:n/a
> 2011-11-08 11:46:59,539 - DEBUG
> [CommitProcessor:1:FinalRequestProcessor@78] - Processing request::
> sessionid:0x23220b93b4a33bc type:createSession cxid:0x0 z
> xid:0x605f90a59 txntype:-10 reqpath:n/a
> 2011-11-08 11:46:59,587 - DEBUG
> [CommitProcessor:1:FinalRequestProcessor@78] - Processing request::
> sessionid:0x23220b93b4a33bc type:closeSession cxid:0x0 zx
> id:0x605f90a72 txntype:-11 reqpath:n/a
> 2011-11-08 11:46:59,589 - DEBUG
> [CommitProcessor:1:FinalRequestProcessor@78] - Processing request::
> sessionid:0x23220b93b4a33bc type:create cxid:0x1 zxid:0x6
> 05f90a78 txntype:1 reqpath:n/a
>
> Here is the stat output from our zookeeper servers -
>
> Latency min/avg/max: 0/53/44712
> Latency min/avg/max: 0/11/11727
> Latency min/avg/max: 0/12/11994
> Latency min/avg/max: 0/9/11707
>
> The relevant bug is here - https://issues.apache.org/jira/browse/ZOOKEEPER-1208
>
> This is a pretty serious issue since the fix is to have someone
> manually delete the ephemeral node. And until that is done, it blocks
> the startup of either the consumer client or the Kafka servers. Any
> help here in debugging this issue is appreciated !
>
> Thanks,
> Neha
>

Mime
View raw message