zookeeper-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From kishore g <g.kish...@gmail.com>
Subject Re: ephemeral node not deleted after client session closed
Date Fri, 11 Nov 2011 18:15:46 GMT
Hi Pat,

We are already working on that, it was because of deploying zk along with
other applications. We will have separate boxes for ZK very soon.

I already saw the stats and could not co relate with spikes, unfortunately
we dint have gc logs. There were few deployment rules that were clearly
violated and we will fix them.

Good news is that we found an issue :-).Thanks again for your help.

thanks,
Kishore G

On Fri, Nov 11, 2011 at 9:47 AM, Patrick Hunt <phunt@apache.org> wrote:

> We always triage new issues as they come in (same when 1208 originally
> came in). However our ability to determine the cause is often bounded
> by the information provided by the user, which in this recent update
> was excellent and pointed out exactly the flaw. Kudos.
>
> We'll button up this issue (1208) for 3.3/3.4/trunk. After which I'll
> work on cutting a new 3.3.4 that includes this and some other fixes.
> It would be good if you could test this patch in the mean time.
>
> Also, this is particularly worrisome for me, you reported the
> following for your servers:
>
> Latency min/avg/max: 0/53/44712
> Latency min/avg/max: 0/11/11727
> Latency min/avg/max: 0/12/11994
> Latency min/avg/max: 0/9/11707
>
> That's really really terrible performance and you won't be happy with
> ZK as a result. You need to followup with your ops team to determine
> why the performance you are seeing is so terrible. Both in terms of
> average and max latency.
>
> Spikes in max latency is usually due to GC, swap, or bad disk
> performance for the WAL.
> Bad average latency might indicate poor network performance, or again
> bad disk performance.
>
> Try turning on CMS/parallelGC. Also try using iostat and look at the
> await times you're seeing for the WAL disk (correlate that with spikes
> in max latency, those counters can be reset using a 4letterword).
>
> Regards,
>
> Patrick
>
> On Fri, Nov 11, 2011 at 7:21 AM, Neha Narkhede <neha.narkhede@gmail.com>
> wrote:
> > Pat,
> >
> > That is excellent turnaround ! I will take a look at the running the test
> > as well as your patch. Will be a good opportunity for me to start
> > understanding the zookeeper codebase.
> >
> > Thanks again,
> > Neha
> >
> > On Thursday, November 10, 2011, Patrick Hunt <phunt@apache.org> wrote:
> >> Ok, patch posted that fixes this (1208). Committers please take a look.
> >>
> >> Neha you might want to give a patched version a try. Awesome job
> >> helping to document and track down this issue. Thanks!
> >>
> >> Patrick
> >>
> >> On Thu, Nov 10, 2011 at 4:43 PM, Patrick Hunt <phunt@apache.org> wrote:
> >>> See my update to 1208 for a test that demonstrates this.
> >>>
> >>> On Thu, Nov 10, 2011 at 3:31 PM, Neha Narkhede <
> neha.narkhede@gmail.com>
> > wrote:
> >>>> Thanks Patrick for looking into this issue !
> >>>>
> >>>>>> The logs would indicate if an election happens. Look for "LOOKING"
> or
> >>>> "LEADING" or "FOLLOWING".
> >>>>
> >>>> The logs don't have any such entries. So I'm guessing there was no
> > election
> >>>> happening.
> >>>>
> >>>> Do you have thoughts, though, on how easy it would be to reproduce
> this
> >>>> bug, to verify the bug fix ?
> >>>>
> >>>> Thanks,
> >>>> Neha
> >>>>
> >>>>
> >>>> On Thu, Nov 10, 2011 at 2:08 PM, Patrick Hunt <phunt1@gmail.com>
> wrote:
> >>>>
> >>>>> On Thu, Nov 10, 2011 at 1:52 PM, Neha Narkhede <
> neha.narkhede@gmail.com
> >>
> >>>>> wrote:
> >>>>> > Thanks for the quick responses, guys! Please find my replies
> inline -
> >>>>> >
> >>>>> >>> 1) Why is the session closed, the client closed it
or the cluster
> >>>>> > expired it?
> >>>>> > Cluster expired it.
> >>>>> >
> >>>>>
> >>>>> Yes, I realized after that the cxid is 0 in your logs - that
> indicates
> >>>>> it was expired and not closed explicitly by the client.
> >>>>>
> >>>>>
> >>>>> >>> 3) the znode exists on all 4 servers, is that right?
> >>>>> > Yes
> >>>>> >
> >>>>>
> >>>>> This holds up my theory that the PrepRequestProcessor is accepting
a
> >>>>> create from the client after the session has been expired.
> >>>>>
> >>>>> >>> 5) why are your max latencies, as well as avg latency,
so high?
> >>>>> >>> a) are these dedicated boxes, not virtualized, correct?
> >>>>> > these are dedicated boxes, but zk is currently co-located with
> > kafka, but
> >>>>> > on different disks
> >>>>> >
> >>>>> >>> b) is the jvm going into gc pause? (try turning on
verbose
> > logging, or
> >>>>> > use "jstat" with the gc options to see the history if you still
> have
> >>>>> > those jvms running)
> >>>>> > I don't believe we had gc logs on these machines. So its unclear.
> >>>>> >
> >>>>> >>> d) do you have dedicated spindles for the ZK WAL? If
not another
> >>>>> > process might be causing the fsyncs to pause. (you can use
iostat
> or
> >>>>> > strace to monitor this)
> >>>>> > No. The log4j and zk txn logs share the same disks.
> >>>>> >
> >>>>> >>> Is that the log from the server that's got the 44sec
max latency?
> >>>>> > Yes.
> >>>>> >
> >>>>> >>> This is 3.3.3 ?
> >>>>> > Yes.
> >>>>> >
> >>>>> >>> was there any instability in the quorum itself during
this time
> >>>>> > period?
> >>>>> > How do I find that out ?
> >>>>>
> >>>>> The logs would indicate if an election happens. Look for "LOOKING"
or
> >>>>> "LEADING" or "FOLLOWING".
> >>>>>
> >>>>>
> >>>>> Your comments are consistent with my theory. Seems like a bug in
PRP
> >>>>> session validation to me.
> >>>>>
> >>>>> Patrick
> >>>>>
> >>>>
> >>>
> >>
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message