zookeeper-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Han <h...@cloudera.com>
Subject Re: zookeeper deployment strategy for multi data centers
Date Fri, 03 Jun 2016 22:02:56 GMT
ZK supports more than just majority quorum rule, there are also weights /
hierarchy of groups based quorum [1]. So probably one can assign more
weights to one out of two data center which can form a weight based quorum
even if another DC is failing?

Another idea is to instead of forming a single ZK ensemble across DCs,
forming multiple ZK ensembles across DCs with one ensemble per DC. This
solution might be applicable for heavy read / light write workload while
providing certain degree of fault tolerance. Some relevant discussions [2].

[1] https://zookeeper.apache.org/doc/r3.4.5/zookeeperAdmin.html (
weight.x=nnnnn, group.x=nnnnn[:nnnnn]).
[2]
https://www.quora.com/Has-anyone-deployed-a-ZooKeeper-ensemble-across-data-centers

On Fri, Jun 3, 2016 at 2:16 PM, Alexander Shraer <shralex@gmail.com> wrote:

> > Is there any settings to override the quorum rule? Would you know the
> rationale behind it?
>
> The rule comes from a theoretical impossibility saying that you must have n
> > 2f replicas
> to tolerate f failures, for any algorithm trying to solve consensus while
> being able to handle
> periods of asynchrony (unbounded message delays, processing times, etc).
> The earliest proof is probably here: paper
> <https://www.cs.utexas.edu/~lorenzo/corsi/cs380d/papers/p225-chandra.pdf>.
> ZooKeeper is assuming this model, so the bound applies
> to it.
>
> The intuition is what's called a 'partition argument'. Essentially if only
> 2f replicas were sufficient, you
> could arbitrarily divide them into 2 sets of f replicas, and create a
> situation where each set of f
> must go on independently without coordinating with the other set (split
> brain), when the links between the two sets are slow (i.e., a network
> partition),
> simply because the other set could also be down (the algorithm tolerates f
> failures) and it can't distinguish the two situations.
> When n > 2f this can be avoided since one of the sets will have majority
> while the other set won't.
>
> The key here is that the links between the two data centers can arbitrarily
> delay messages, so an automatic
> 'fail-over' where one data center decides that the other one is down is
> usually considered unsafe. If in your system
> you have a reliable way to know that the other data center is really in
> fact down (this is a synchrony assumption), you could do as Camille
> suggested and
> reconfigure the system to only include the remaining data center. This
> would still be very tricky to do since this reconfiguration
> would have to involve manually changing configuration files and rebooting
> servers, while somehow making sure that you're
> not loosing committed state. So not recommended.
>
>
>
> On Fri, Jun 3, 2016 at 11:30 PM, Camille Fournier <camille@apache.org>
> wrote:
>
> > 2 servers is the same as 1 server wrt fault tolerance, so yes, you are
> > correct. If they want fault tolerance, they have to run 3 (or more).
> >
> > On Fri, Jun 3, 2016 at 4:25 PM, Shawn Heisey <apache@elyograg.org>
> wrote:
> >
> > > On 6/3/2016 1:44 PM, Nomar Morado wrote:
> > > > Is there any settings to override the quorum rule? Would you know the
> > > > rationale behind it? Ideally, you will want to operate the
> application
> > > > even if at least one data center is up.
> > >
> > > I do not know if the quorum rule can be overridden, or whether your
> > > application can tell the difference between a loss of quorum and
> > > zookeeper going down entirely.  I really don't know anything about
> > > zookeeper client code or zookeeper internals.
> > >
> > > From what I understand, majority quorum is the only way to be
> > > *completely* sure that cluster software like SolrCloud or your
> > > application can handle write operations with confidence that they are
> > > applied correctly.  If you lose quorum, which will happen if only one
> DC
> > > is operational, then your application should go read-only.  This is
> what
> > > SolrCloud does.
> > >
> > > I am a committer on the Apache Solr project, and Solr uses zookeeper
> > > when it is running in SolrCloud mode.  The cloud code is handled by
> > > other people -- I don't know much about it.
> > >
> > > I joined this list because I wanted to have the ZK devs include a
> > > clarification in zookeeper documentation -- oddly enough, related to
> the
> > > very thing we are discussing.  I wanted to be sure that the
> > > documentation explicitly mentioned that three serversare required for a
> > > fault-tolerant setup.  Some SolrCloud users don't want to accept this
> as
> > > a fact, and believe that two servers should be enough.
> > >
> > > Thanks,
> > > Shawn
> > >
> > >
> >
>



-- 
Cheers
Michael.

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message