zookeeper-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Patrick Hunt <ph...@apache.org>
Subject Re: ephemeral node not deleted after client session closed
Date Fri, 11 Nov 2011 00:43:25 GMT
See my update to 1208 for a test that demonstrates this.

On Thu, Nov 10, 2011 at 3:31 PM, Neha Narkhede <neha.narkhede@gmail.com> wrote:
> Thanks Patrick for looking into this issue !
>
>>> The logs would indicate if an election happens. Look for "LOOKING" or
> "LEADING" or "FOLLOWING".
>
> The logs don't have any such entries. So I'm guessing there was no election
> happening.
>
> Do you have thoughts, though, on how easy it would be to reproduce this
> bug, to verify the bug fix ?
>
> Thanks,
> Neha
>
>
> On Thu, Nov 10, 2011 at 2:08 PM, Patrick Hunt <phunt1@gmail.com> wrote:
>
>> On Thu, Nov 10, 2011 at 1:52 PM, Neha Narkhede <neha.narkhede@gmail.com>
>> wrote:
>> > Thanks for the quick responses, guys! Please find my replies inline -
>> >
>> >>> 1) Why is the session closed, the client closed it or the cluster
>> > expired it?
>> > Cluster expired it.
>> >
>>
>> Yes, I realized after that the cxid is 0 in your logs - that indicates
>> it was expired and not closed explicitly by the client.
>>
>>
>> >>> 3) the znode exists on all 4 servers, is that right?
>> > Yes
>> >
>>
>> This holds up my theory that the PrepRequestProcessor is accepting a
>> create from the client after the session has been expired.
>>
>> >>> 5) why are your max latencies, as well as avg latency, so high?
>> >>> a) are these dedicated boxes, not virtualized, correct?
>> > these are dedicated boxes, but zk is currently co-located with kafka, but
>> > on different disks
>> >
>> >>> b) is the jvm going into gc pause? (try turning on verbose logging,
or
>> > use "jstat" with the gc options to see the history if you still have
>> > those jvms running)
>> > I don't believe we had gc logs on these machines. So its unclear.
>> >
>> >>> d) do you have dedicated spindles for the ZK WAL? If not another
>> > process might be causing the fsyncs to pause. (you can use iostat or
>> > strace to monitor this)
>> > No. The log4j and zk txn logs share the same disks.
>> >
>> >>> Is that the log from the server that's got the 44sec max latency?
>> > Yes.
>> >
>> >>> This is 3.3.3 ?
>> > Yes.
>> >
>> >>> was there any instability in the quorum itself during this time
>> > period?
>> > How do I find that out ?
>>
>> The logs would indicate if an election happens. Look for "LOOKING" or
>> "LEADING" or "FOLLOWING".
>>
>>
>> Your comments are consistent with my theory. Seems like a bug in PRP
>> session validation to me.
>>
>> Patrick
>>
>

Mime
View raw message