zookeeper-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "FPJ" <fpjunque...@yahoo.com>
Subject RE: ZOOKEEPER-900 / 901 / 1678
Date Wed, 30 Apr 2014 08:09:23 GMT
Hi Cameron,

Which version of ZK are you using? Also, if you can share logs, then it might be easier for
us to help you out.

-Flavio

> -----Original Message-----
> From: Cameron McKenzie [mailto:mckenzie.cam@gmail.com]
> Sent: 30 April 2014 08:44
> To: zookeeper-user@hadoop.apache.org
> Subject: ZOOKEEPER-900 / 901 / 1678
> 
> ZooKeeper users,
> Does anyone know the status of these issues? They don't seem to have had
> anything done to them since late 2010?
> 
> I think that we're experiencing the same issue currently. If we have a 3 node
> cluster for example, and 1 of these nodes is completely dead (i.e the entire
> host is not contactable due to a power outage), I would expect that a
> quorum could still be formed, but this does not appear to be the case.
> 
> I haven't delved into the code too much, but it appears that blocking IO is
> being used for the connect. This doesn't respect the socket SO timeout being
> set, so it means that the connect() call can block for some arbitrary amount of
> time (based on the OS level TCP settings?). This in turn means that leader
> election will fail because it times out before the socket connect does, even
> though there are enough live hosts present to form a quorum.
> 
> This seems like a fairly fundamental problem, unless I'm missing something.
> If a single host goes down due to a power failure for example, it can prevent
> any further hosts joining the cluster. In addition, if after a power failure,
> enough hosts come back online to form a quorum, but some don't, that a
> quorum may still not be able to be formed.
> cheers
> Cam


Mime
View raw message