zookeeper-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Manosiz Bhattacharyya <manos...@gmail.com>
Subject Re: Timeouts and ping handling
Date Thu, 19 Jan 2012 19:47:08 GMT
Thanks,
Manosiz.

On Thu, Jan 19, 2012 at 11:31 AM, Patrick Hunt <phunt@apache.org> wrote:

> See "preAllocSize"
>
> http://zookeeper.apache.org/doc/r3.4.2/zookeeperAdmin.html#sc_advancedConfiguration
>
> On Thu, Jan 19, 2012 at 10:49 AM, Manosiz Bhattacharyya
> <manosizb@gmail.com> wrote:
> > Thanks a lot for this info. A pointer in the code to where you do this
> > preallocation or a flag to disable this would be very beneficial.
> >
> > On Thu, Jan 19, 2012 at 10:18 AM, Ted Dunning <ted.dunning@gmail.com>
> wrote:
> >
> >> ZK does pretty much entirely sequential I/O.
> >>
> >> One thing that it does which might be very, very bad for SSD is that it
> >> pre-allocates disk extents in the log by writing a bunch of zeros.
>  This is
> >> to avoid directory updates as the log is written, but it doubles the
> load
> >> on the SSD.
> >>
> >> On Thu, Jan 19, 2012 at 5:31 PM, Manosiz Bhattacharyya
> >> <manosizb@gmail.com>wrote:
> >>
> >> > I do not think that there is a problem with the queue size. I guess
> the
> >> > problem is more with latency when the Fusion I/O goes in for a GC. We
> are
> >> > enabling stats on the Zookeeper and the fusion I/O to be more precise.
> >> Does
> >> > Zookeeper typically do only sequential I/O, or does it do some random
> >> too.
> >> > We could then move the logs to a disk.
> >> >
> >> > Thanks,
> >> > Manosiz.
> >> >
> >> > On Wed, Jan 18, 2012 at 10:18 PM, Ted Dunning <ted.dunning@gmail.com>
> >> > wrote:
> >> >
> >> > > If you aren't pushing much data through ZK, there is almost no way
> that
> >> > the
> >> > > request queue can fill up without the log or snapshot disks being
> slow.
> >> > >  See what happens if you put the log into a real disk or (heaven
> help
> >> us)
> >> > > onto a tmpfs partition.
> >> > >
> >> > > On Thu, Jan 19, 2012 at 2:18 AM, Manosiz Bhattacharyya
> >> > > <manosizb@gmail.com>wrote:
> >> > >
> >> > > > I will do as you mention.
> >> > > >
> >> > > > We are using the async API's throughout. Also we do not write
too
> >> much
> >> > > data
> >> > > > into Zookeeper. We just use it for leadership elections and health
> >> > > > monitoring, which is why we see the timeouts typically on idle
> >> > zookeeper
> >> > > > connections.
> >> > > >
> >> > > > The reason why we want the sessions to be alive is because of
the
> >> > > > leadership election algorithm that we use from the zookeeper
> recipe.
> >> So
> >> > > if
> >> > > > a connection is broken for the leader node, the ephemeral node
> that
> >> > > > guaranteed its leadership is lost, and reconnecting will create
a
> new
> >> > > node
> >> > > > which does not guarantee leadership. We then have to re-elect
a
> new
> >> > > leader
> >> > > > - which requires significant work. The bigger the timeout, bigger
> is
> >> > the
> >> > > > time the cluster stays without a master for a particular service,
> as
> >> > the
> >> > > > old master cannot keep on working once it has known its session
is
> >> gone
> >> > > and
> >> > > > with it, its ephemeral node. As we are trying to have highly
> >> available
> >> > > > service (not internet scale, but at the scale of a storage system
> >> with
> >> > ms
> >> > > > latencies typically), we thought about reducing the timeout,
but
> >> > keeping
> >> > > > the session open. Also note the node that typically is the master
> >> does
> >> > > not
> >> > > > write too often into zookeeper.
> >> > > >
> >> > > > Thanks,
> >> > > > Manosiz.
> >> > > >
> >> > > > On Wed, Jan 18, 2012 at 5:49 PM, Patrick Hunt <phunt@apache.org>
> >> > wrote:
> >> > > >
> >> > > > > On Wed, Jan 18, 2012 at 4:47 PM, Manosiz Bhattacharyya
> >> > > > > <manosizb@gmail.com> wrote:
> >> > > > > > Thanks Patrick for your answer,
> >> > > > >
> >> > > > > No problem.
> >> > > > >
> >> > > > > > Actually we are in a virtualized environment, we have
a FIO
> disk
> >> > for
> >> > > > > > transactional logs. It does have some latency sometimes
during
> >> FIO
> >> > > > > garbage
> >> > > > > > collection. We know this could be the potential issue,
but was
> >> > trying
> >> > > > to
> >> > > > > > workaround that.
> >> > > > >
> >> > > > > Ah, I see. I saw something very similar to this recently
with
> SSDs
> >> > > > > used for the datadir. The fdatasync latency was sometimes
> 10
> >> > > > > seconds. I suspect it happened as a result of disk GC activity.
> >> > > > >
> >> > > > > I was able to identify the problem by running something
like
> this:
> >> > > > >
> >> > > > > sudo strace -r -T -f -p 8066 -e trace=fsync,fdatasync -o
> trace.txt
> >> > > > >
> >> > > > > and then graphing the results (log scale). You should try
> running
> >> > this
> >> > > > > against your servers to confirm that it is indeed the problem.
> >> > > > >
> >> > > > > > We were trying to qualify the requests into two types
- either
> >> HB's
> >> > > or
> >> > > > > > normal requests. Isn't it better to reject normal requests
if
> the
> >> > > queue
> >> > > > > > size is full to say a certain threshold, but keep the
session
> >> > alive.
> >> > > > That
> >> > > > > > way the flow control can be achieved with the users
session
> >> > retrying
> >> > > > the
> >> > > > > > operation, but the session health would be maintained.
> >> > > > >
> >> > > > > What good is a session (connection) that's not usable? You're
> >> better
> >> > > > > off disconnecting and re-establishing with a server that
can
> >> process
> >> > > > > your requests in a timely fashion.
> >> > > > >
> >> > > > > ZK looks at availability from a service perspective, not
from an
> >> > > > > individual session/connection perspective. The whole more
> important
> >> > > > > than the parts. There already is very sophisticated flow
control
> >> > going
> >> > > > > on - e.g. the sessions shut down and stop reading requests
when
> the
> >> > > > > number of outstanding requests on a server exceeds some
> threshold.
> >> > > > > Once the server catches up it starts reading again. Again
-
> >> checkout
> >> > > > > your "stat" results for insight into this. (ie "outstanding
> >> > requests")
> >> > > > >
> >> > > > > Patrick
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message