hbase-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "stack (JIRA)" <j...@apache.org>
Subject [jira] [Created] (HBASE-8748) Be able to accomodate zookeeper going away for a minute or two -- or more
Date Sat, 15 Jun 2013 21:23:19 GMT
stack created HBASE-8748:
----------------------------

             Summary: Be able to accomodate zookeeper going away for a minute or two -- or
more
                 Key: HBASE-8748
                 URL: https://issues.apache.org/jira/browse/HBASE-8748
             Project: HBase
          Issue Type: Brainstorming
          Components: Zookeeper
            Reporter: stack


I was talking w/ Christophe Taton yesterday and he asked what happens if zookeeper goes away
for a minute or two -- say a network or ensemble hiccup of some type -- then what happens?

Unless the ensemble comes back inside the zk session timeout, the cluster will go down.

To my knowledge, zk has hiccuped a few times.  There was the bug where sequence numbers rolled
around the top causing the ensemble to blip (fixed in a newer zk).  There was another event
where <speculation>some combination of a leader election and accumulated log files (>100k)</speculation>
caused the ensemble blip at SU.  

At FB apparently the zk session is way up -- > 5minutes -- in case a top-of-the-rack switch
reboots partitioning the network separating nodes from the zk ensemble and rather than rely
on presence of ephemeral nodes, rather, they depend on heartbeats to determine presence or
not of a regionserver (w/ some smarts so that if all members of a rack disappear at the same
time, it is not likely they all crashed at same time).

I am stating the obvious I know but the base presumption that zk will just always be there
is lazy on our part and we should not be acting as though it were.

Marking this a brainstorming issue because will need a bit of discussion/design undoing our
current presumption.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message