zookeeper-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Patrick Hunt <ph...@apache.org>
Subject Re: Timeouts and ping handling
Date Thu, 19 Jan 2012 19:31:46 GMT
See "preAllocSize"
http://zookeeper.apache.org/doc/r3.4.2/zookeeperAdmin.html#sc_advancedConfiguration

On Thu, Jan 19, 2012 at 10:49 AM, Manosiz Bhattacharyya
<manosizb@gmail.com> wrote:
> Thanks a lot for this info. A pointer in the code to where you do this
> preallocation or a flag to disable this would be very beneficial.
>
> On Thu, Jan 19, 2012 at 10:18 AM, Ted Dunning <ted.dunning@gmail.com> wrote:
>
>> ZK does pretty much entirely sequential I/O.
>>
>> One thing that it does which might be very, very bad for SSD is that it
>> pre-allocates disk extents in the log by writing a bunch of zeros.  This is
>> to avoid directory updates as the log is written, but it doubles the load
>> on the SSD.
>>
>> On Thu, Jan 19, 2012 at 5:31 PM, Manosiz Bhattacharyya
>> <manosizb@gmail.com>wrote:
>>
>> > I do not think that there is a problem with the queue size. I guess the
>> > problem is more with latency when the Fusion I/O goes in for a GC. We are
>> > enabling stats on the Zookeeper and the fusion I/O to be more precise.
>> Does
>> > Zookeeper typically do only sequential I/O, or does it do some random
>> too.
>> > We could then move the logs to a disk.
>> >
>> > Thanks,
>> > Manosiz.
>> >
>> > On Wed, Jan 18, 2012 at 10:18 PM, Ted Dunning <ted.dunning@gmail.com>
>> > wrote:
>> >
>> > > If you aren't pushing much data through ZK, there is almost no way that
>> > the
>> > > request queue can fill up without the log or snapshot disks being slow.
>> > >  See what happens if you put the log into a real disk or (heaven help
>> us)
>> > > onto a tmpfs partition.
>> > >
>> > > On Thu, Jan 19, 2012 at 2:18 AM, Manosiz Bhattacharyya
>> > > <manosizb@gmail.com>wrote:
>> > >
>> > > > I will do as you mention.
>> > > >
>> > > > We are using the async API's throughout. Also we do not write too
>> much
>> > > data
>> > > > into Zookeeper. We just use it for leadership elections and health
>> > > > monitoring, which is why we see the timeouts typically on idle
>> > zookeeper
>> > > > connections.
>> > > >
>> > > > The reason why we want the sessions to be alive is because of the
>> > > > leadership election algorithm that we use from the zookeeper recipe.
>> So
>> > > if
>> > > > a connection is broken for the leader node, the ephemeral node that
>> > > > guaranteed its leadership is lost, and reconnecting will create a
new
>> > > node
>> > > > which does not guarantee leadership. We then have to re-elect a new
>> > > leader
>> > > > - which requires significant work. The bigger the timeout, bigger
is
>> > the
>> > > > time the cluster stays without a master for a particular service,
as
>> > the
>> > > > old master cannot keep on working once it has known its session is
>> gone
>> > > and
>> > > > with it, its ephemeral node. As we are trying to have highly
>> available
>> > > > service (not internet scale, but at the scale of a storage system
>> with
>> > ms
>> > > > latencies typically), we thought about reducing the timeout, but
>> > keeping
>> > > > the session open. Also note the node that typically is the master
>> does
>> > > not
>> > > > write too often into zookeeper.
>> > > >
>> > > > Thanks,
>> > > > Manosiz.
>> > > >
>> > > > On Wed, Jan 18, 2012 at 5:49 PM, Patrick Hunt <phunt@apache.org>
>> > wrote:
>> > > >
>> > > > > On Wed, Jan 18, 2012 at 4:47 PM, Manosiz Bhattacharyya
>> > > > > <manosizb@gmail.com> wrote:
>> > > > > > Thanks Patrick for your answer,
>> > > > >
>> > > > > No problem.
>> > > > >
>> > > > > > Actually we are in a virtualized environment, we have a
FIO disk
>> > for
>> > > > > > transactional logs. It does have some latency sometimes
during
>> FIO
>> > > > > garbage
>> > > > > > collection. We know this could be the potential issue, but
was
>> > trying
>> > > > to
>> > > > > > workaround that.
>> > > > >
>> > > > > Ah, I see. I saw something very similar to this recently with
SSDs
>> > > > > used for the datadir. The fdatasync latency was sometimes >
10
>> > > > > seconds. I suspect it happened as a result of disk GC activity.
>> > > > >
>> > > > > I was able to identify the problem by running something like
this:
>> > > > >
>> > > > > sudo strace -r -T -f -p 8066 -e trace=fsync,fdatasync -o trace.txt
>> > > > >
>> > > > > and then graphing the results (log scale). You should try running
>> > this
>> > > > > against your servers to confirm that it is indeed the problem.
>> > > > >
>> > > > > > We were trying to qualify the requests into two types -
either
>> HB's
>> > > or
>> > > > > > normal requests. Isn't it better to reject normal requests
if the
>> > > queue
>> > > > > > size is full to say a certain threshold, but keep the session
>> > alive.
>> > > > That
>> > > > > > way the flow control can be achieved with the users session
>> > retrying
>> > > > the
>> > > > > > operation, but the session health would be maintained.
>> > > > >
>> > > > > What good is a session (connection) that's not usable? You're
>> better
>> > > > > off disconnecting and re-establishing with a server that can
>> process
>> > > > > your requests in a timely fashion.
>> > > > >
>> > > > > ZK looks at availability from a service perspective, not from
an
>> > > > > individual session/connection perspective. The whole more important
>> > > > > than the parts. There already is very sophisticated flow control
>> > going
>> > > > > on - e.g. the sessions shut down and stop reading requests when
the
>> > > > > number of outstanding requests on a server exceeds some threshold.
>> > > > > Once the server catches up it starts reading again. Again -
>> checkout
>> > > > > your "stat" results for insight into this. (ie "outstanding
>> > requests")
>> > > > >
>> > > > > Patrick
>> > > > >
>> > > >
>> > >
>> >
>>

Mime
View raw message