zookeeper-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Cameron McKenzie <mckenzie....@gmail.com>
Subject ZOOKEEPER-900 / 901 / 1678
Date Wed, 30 Apr 2014 07:43:34 GMT
ZooKeeper users,
Does anyone know the status of these issues? They don't seem to have had
anything done to them since late 2010?

I think that we're experiencing the same issue currently. If we have a 3
node cluster for example, and 1 of these nodes is completely dead (i.e the
entire host is not contactable due to a power outage), I would expect that
a quorum could still be formed, but this does not appear to be the case.

I haven't delved into the code too much, but it appears that blocking IO is
being used for the connect. This doesn't respect the socket SO timeout
being set, so it means that the connect() call can block for some arbitrary
amount of time (based on the OS level TCP settings?). This in turn means
that leader election will fail because it times out before the socket
connect does, even though there are enough live hosts present to form a
quorum.

This seems like a fairly fundamental problem, unless I'm missing something.
If a single host goes down due to a power failure for example, it can
prevent any further hosts joining the cluster. In addition, if after a
power failure, enough hosts come back online to form a quorum, but some
don't, that a quorum may still not be able to be formed.
cheers
Cam

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message