zookeeper-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <ted.dunn...@gmail.com>
Subject Re: Timeouts and ping handling
Date Thu, 19 Jan 2012 18:18:03 GMT
ZK does pretty much entirely sequential I/O.

One thing that it does which might be very, very bad for SSD is that it
pre-allocates disk extents in the log by writing a bunch of zeros.  This is
to avoid directory updates as the log is written, but it doubles the load
on the SSD.

On Thu, Jan 19, 2012 at 5:31 PM, Manosiz Bhattacharyya
<manosizb@gmail.com>wrote:

> I do not think that there is a problem with the queue size. I guess the
> problem is more with latency when the Fusion I/O goes in for a GC. We are
> enabling stats on the Zookeeper and the fusion I/O to be more precise. Does
> Zookeeper typically do only sequential I/O, or does it do some random too.
> We could then move the logs to a disk.
>
> Thanks,
> Manosiz.
>
> On Wed, Jan 18, 2012 at 10:18 PM, Ted Dunning <ted.dunning@gmail.com>
> wrote:
>
> > If you aren't pushing much data through ZK, there is almost no way that
> the
> > request queue can fill up without the log or snapshot disks being slow.
> >  See what happens if you put the log into a real disk or (heaven help us)
> > onto a tmpfs partition.
> >
> > On Thu, Jan 19, 2012 at 2:18 AM, Manosiz Bhattacharyya
> > <manosizb@gmail.com>wrote:
> >
> > > I will do as you mention.
> > >
> > > We are using the async API's throughout. Also we do not write too much
> > data
> > > into Zookeeper. We just use it for leadership elections and health
> > > monitoring, which is why we see the timeouts typically on idle
> zookeeper
> > > connections.
> > >
> > > The reason why we want the sessions to be alive is because of the
> > > leadership election algorithm that we use from the zookeeper recipe. So
> > if
> > > a connection is broken for the leader node, the ephemeral node that
> > > guaranteed its leadership is lost, and reconnecting will create a new
> > node
> > > which does not guarantee leadership. We then have to re-elect a new
> > leader
> > > - which requires significant work. The bigger the timeout, bigger is
> the
> > > time the cluster stays without a master for a particular service, as
> the
> > > old master cannot keep on working once it has known its session is gone
> > and
> > > with it, its ephemeral node. As we are trying to have highly available
> > > service (not internet scale, but at the scale of a storage system with
> ms
> > > latencies typically), we thought about reducing the timeout, but
> keeping
> > > the session open. Also note the node that typically is the master does
> > not
> > > write too often into zookeeper.
> > >
> > > Thanks,
> > > Manosiz.
> > >
> > > On Wed, Jan 18, 2012 at 5:49 PM, Patrick Hunt <phunt@apache.org>
> wrote:
> > >
> > > > On Wed, Jan 18, 2012 at 4:47 PM, Manosiz Bhattacharyya
> > > > <manosizb@gmail.com> wrote:
> > > > > Thanks Patrick for your answer,
> > > >
> > > > No problem.
> > > >
> > > > > Actually we are in a virtualized environment, we have a FIO disk
> for
> > > > > transactional logs. It does have some latency sometimes during FIO
> > > > garbage
> > > > > collection. We know this could be the potential issue, but was
> trying
> > > to
> > > > > workaround that.
> > > >
> > > > Ah, I see. I saw something very similar to this recently with SSDs
> > > > used for the datadir. The fdatasync latency was sometimes > 10
> > > > seconds. I suspect it happened as a result of disk GC activity.
> > > >
> > > > I was able to identify the problem by running something like this:
> > > >
> > > > sudo strace -r -T -f -p 8066 -e trace=fsync,fdatasync -o trace.txt
> > > >
> > > > and then graphing the results (log scale). You should try running
> this
> > > > against your servers to confirm that it is indeed the problem.
> > > >
> > > > > We were trying to qualify the requests into two types - either HB's
> > or
> > > > > normal requests. Isn't it better to reject normal requests if the
> > queue
> > > > > size is full to say a certain threshold, but keep the session
> alive.
> > > That
> > > > > way the flow control can be achieved with the users session
> retrying
> > > the
> > > > > operation, but the session health would be maintained.
> > > >
> > > > What good is a session (connection) that's not usable? You're better
> > > > off disconnecting and re-establishing with a server that can process
> > > > your requests in a timely fashion.
> > > >
> > > > ZK looks at availability from a service perspective, not from an
> > > > individual session/connection perspective. The whole more important
> > > > than the parts. There already is very sophisticated flow control
> going
> > > > on - e.g. the sessions shut down and stop reading requests when the
> > > > number of outstanding requests on a server exceeds some threshold.
> > > > Once the server catches up it starts reading again. Again - checkout
> > > > your "stat" results for insight into this. (ie "outstanding
> requests")
> > > >
> > > > Patrick
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message