zookeeper-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Patrick Hunt <ph...@apache.org>
Subject Re: ephemeral node not deleted after client session closed
Date Fri, 11 Nov 2011 01:47:33 GMT
Ok, patch posted that fixes this (1208). Committers please take a look.

Neha you might want to give a patched version a try. Awesome job
helping to document and track down this issue. Thanks!

Patrick

On Thu, Nov 10, 2011 at 4:43 PM, Patrick Hunt <phunt@apache.org> wrote:
> See my update to 1208 for a test that demonstrates this.
>
> On Thu, Nov 10, 2011 at 3:31 PM, Neha Narkhede <neha.narkhede@gmail.com> wrote:
>> Thanks Patrick for looking into this issue !
>>
>>>> The logs would indicate if an election happens. Look for "LOOKING" or
>> "LEADING" or "FOLLOWING".
>>
>> The logs don't have any such entries. So I'm guessing there was no election
>> happening.
>>
>> Do you have thoughts, though, on how easy it would be to reproduce this
>> bug, to verify the bug fix ?
>>
>> Thanks,
>> Neha
>>
>>
>> On Thu, Nov 10, 2011 at 2:08 PM, Patrick Hunt <phunt1@gmail.com> wrote:
>>
>>> On Thu, Nov 10, 2011 at 1:52 PM, Neha Narkhede <neha.narkhede@gmail.com>
>>> wrote:
>>> > Thanks for the quick responses, guys! Please find my replies inline -
>>> >
>>> >>> 1) Why is the session closed, the client closed it or the cluster
>>> > expired it?
>>> > Cluster expired it.
>>> >
>>>
>>> Yes, I realized after that the cxid is 0 in your logs - that indicates
>>> it was expired and not closed explicitly by the client.
>>>
>>>
>>> >>> 3) the znode exists on all 4 servers, is that right?
>>> > Yes
>>> >
>>>
>>> This holds up my theory that the PrepRequestProcessor is accepting a
>>> create from the client after the session has been expired.
>>>
>>> >>> 5) why are your max latencies, as well as avg latency, so high?
>>> >>> a) are these dedicated boxes, not virtualized, correct?
>>> > these are dedicated boxes, but zk is currently co-located with kafka, but
>>> > on different disks
>>> >
>>> >>> b) is the jvm going into gc pause? (try turning on verbose logging,
or
>>> > use "jstat" with the gc options to see the history if you still have
>>> > those jvms running)
>>> > I don't believe we had gc logs on these machines. So its unclear.
>>> >
>>> >>> d) do you have dedicated spindles for the ZK WAL? If not another
>>> > process might be causing the fsyncs to pause. (you can use iostat or
>>> > strace to monitor this)
>>> > No. The log4j and zk txn logs share the same disks.
>>> >
>>> >>> Is that the log from the server that's got the 44sec max latency?
>>> > Yes.
>>> >
>>> >>> This is 3.3.3 ?
>>> > Yes.
>>> >
>>> >>> was there any instability in the quorum itself during this time
>>> > period?
>>> > How do I find that out ?
>>>
>>> The logs would indicate if an election happens. Look for "LOOKING" or
>>> "LEADING" or "FOLLOWING".
>>>
>>>
>>> Your comments are consistent with my theory. Seems like a bug in PRP
>>> session validation to me.
>>>
>>> Patrick
>>>
>>
>

Mime
View raw message