zookeeper-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Manosiz Bhattacharyya <manos...@gmail.com>
Subject Re: Timeouts and ping handling
Date Thu, 19 Jan 2012 00:47:52 GMT
Thanks Patrick for your answer,

Actually we are in a virtualized environment, we have a FIO disk for
transactional logs. It does have some latency sometimes during FIO garbage
collection. We know this could be the potential issue, but was trying to
workaround that.

We were trying to qualify the requests into two types - either HB's or
normal requests. Isn't it better to reject normal requests if the queue
size is full to say a certain threshold, but keep the session alive. That
way the flow control can be achieved with the users session retrying the
operation, but the session health would be maintained.

Regards,
Manosiz.

On Wed, Jan 18, 2012 at 2:53 PM, Patrick Hunt <phunt@apache.org> wrote:

> Next up is disk. (I'm assuming you're not running in a virtualized
> environment, correct?) You have a dedicated log device for the
> transactional logs? Check your disk latency and make sure that's not
> holding up the writes.
>
> What does "stat" show you wrt latency in general and at the time you
> see the issue on the client?
>
> You've looked through the troubleshooting guide?
> http://wiki.apache.org/hadoop/ZooKeeper/Troubleshooting
>
> Patrick
>
> On Wed, Jan 18, 2012 at 2:47 PM, Manosiz Bhattacharyya
> <manosizb@gmail.com> wrote:
> > Thanks a lot for your response. We are running the c-client, as all our
> > components are C++ applications. We are tracing GC on the server side,
> but
> > did not see much activity there. We did tune GC. Our gc flags include the
> > following
> >
> > JVMFLAGS="$JVMFLAGS -XX:+UseParNewGC"
> > JVMFLAGS="$JVMFLAGS -XX:+UseConcMarkSweepGC"
> > JVMFLAGS="$JVMFLAGS -XX:+CMSParallelRemarkEnabled"
> > JVMFLAGS="$JVMFLAGS -XX:SurvivorRatio=8"
> > JVMFLAGS="$JVMFLAGS -XX:MaxTenuringThreshold=1"
> > JVMFLAGS="$JVMFLAGS -XX:CMSInitiatingOccupancyFraction=75"
> > JVMFLAGS="$JVMFLAGS -XX:+UseCMSInitiatingOccupancyOnly"
> > JVMFLAGS="$JVMFLAGS -XX:ParallelCMSThreads=1"
> >
> > The JMX console shows that the old gen is not getting full at all - the
> new
> > gen is pretty much where the activity is and the pauses in the verbose:gc
> > only shows about times in 10-20 ms.
> >
> > On Wed, Jan 18, 2012 at 2:34 PM, Patrick Hunt <phunt@apache.org> wrote:
> >
> >> 5 seconds is fairly low. HBs are sent by the client every 1/3 the
> >> timeout, with expectation that it will get a response in another 1/3
> >> the timeout. if not the client session will time out.
> >>
> >> As a result, any blip of 1.5 sec or more btw the client and server
> >> could cause this to happen. Network latency, OS latency, ZK server
> >> latency, client latency etc....
> >>
> >> I suspect that you are being effected by GC pauses. Have you tuned the
> >> GC at all or just the defaults? Monitor the GC in the VM during
> >> operation and see if this is effecting you. At the very least you need
> >> to turn on parallel/CMS/incremental GC.
> >>
> >> Patrick
> >>
> >> On Wed, Jan 18, 2012 at 1:26 PM, Manosiz Bhattacharyya
> >> <manosizb@gmail.com> wrote:
> >> > Hello,
> >> >
> >> >  We are using Zookeeper-3.3.4 with client session timeouts of 5
> seconds,
> >> > and we see frequent timeouts. We have a cluster of 50 nodes (3 of
> which
> >> are
> >> > ZK nodes) and each node has 5 client connections (a total of 250
> >> connection
> >> > to the Ensemble). While investigating the zookeeper connections, we
> found
> >> > that sometimes pings sent from the zookeeper client does not return
> from
> >> > the server within 5 seconds, and the client connection gets
> disconnected.
> >> > Digging deeper it seems that pings are enqueued the same way as other
> >> > requests in the three stage request processing pipeline (prep, sync,
> >> > finalize) in zkserver. So if there are a lot of write operations from
> >> other
> >> > active sessions in front of a ping from an inactive session in the
> >> queues,
> >> > the inactive session could timeout.
> >> >
> >> > My question is whether we can return the ping request from the client
> >> > immediately from the server, as the purpose of the ping request seems
> to
> >> be
> >> > to treat it as an heartbeat from relatively inactive sessions. If we
> >> keep a
> >> > separate ping queue in the Prep phase which forwards it straight to
> the
> >> > finalize phase, possible requests before the ping which required I/O
> >> inside
> >> > the sync phase would not cause the client timeouts. I hope pings do
> not
> >> > generate any order in the database. I did take a cursory look at the
> code
> >> > and thought that could be done. Would really appreciate an opinion
> >> > regarding this.
> >> >
> >> > As an aside I should mention that increasing the session timeout to 20
> >> > seconds did improved the problem significantly. However as we are
> using
> >> > Zookeeper to monitor health of our components, increasing the timeout
> >> means
> >> > that we only get to know a component's death 20 seconds later. This is
> >> > something we would definitely try to avoid, and would like to go to
> the 5
> >> > second timeout.
> >> >
> >> > Regards,
> >> > Manosiz.
> >>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message