zookeeper-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Flavio Junqueira <...@apache.org>
Subject Re: zookeeper deployment strategy for multi data centers
Date Sat, 04 Jun 2016 14:37:40 GMT
There is no black magic going on with weights or hierarchies. Whatever you do, you can't escape
from the curse of the intersection, and I'm sorry for stating the obvious, but that's the
key reason why masking a data center going down out of two is so hard.

In general, using majorities makes it simpler to reason about the problem, in particular because
the failure scenarios are uniform. With weights, you can play some tricks, but they do not
necessarily give you a great advantage. For example, say I have 6 servers, I split them between
two data centers, and I assign a weight of 2 to one of them. I can form a quorum in one single
data center despite the fact that I have an equal number of servers in each data center. I
could alternatively have 5 voting members and one observer and get a similar behavior. One
difference is that in the case the server with weight 2 is up, I can tolerate up to 3 nodes
crashing. If the server with weight 2 is down, then I can tolerate only one additional crash.
With the 5V+1O configuration, I can tolerate two crashed servers always.

The hierarchical stuff starts making sense for me with more complex scenarios. For example,
say I have three data centers and I put three servers in each. Using a hierarchy, I can tolerate
one data center going down plus one crash in each of the remaining data centers, so I can
make progress with 4 nodes up, even though my total is 9. This works because we pick a majority
of votes from a majority of groups. Note that it is not any 4 servers, though. If one data
center is down, then I can't crash two nodes in one of the remaining data centers. Also, because
we only require 4 votes to form a quorum, we require fewer votes to commit requests, and possibly
fewer cross-DC votes. This last observation affects mostly failure scenarios because with
majority, one also needs only two cross-dc votes to form a quorum in the absence of crashes.

The scenario that has been traditionally a problem, and that's really not new, is the active-passive
one, which we can't really make it work transparently. The workaround is to manually reconfigure
servers, knowing that there could be some data loss. Even with reconfiguration in 3.5, we
are still stuck in the case the active goes down because we need an old quorum to get the
reconfiguration to succeed. 

It is possible that the hierarchical approach makes some sense for some scenarios because
of the following argument. It is not always desirable, but say that you want to replicate
synchronously across data centers. If I use the 5V + 1O configuration with 3V in the active
DC and 2V in the passive DC, I can't guarantee that committed requests will be persisted in
at least one node in the passive DC. In fact, to make that guarantee, we need a whole DC worth
of votes + 1. With the hierarchical approach, we can have two groups, which forces every commit
to be replicated in both groups, but I can tolerate crashes in both groups and guarantee that
updates are synchronously replicated. In the case the active DC goes down, we can reconfigure
manually from two groups to one group, and since the replication is synchronous, the passive
DC will have all commits.

Hopefully this analysis is correct and makes some sense. Cross-dc replication is a fascinating


The hierarchical quorum stuff starts to make sense when you have more complex deployments,
not when 
> On 03 Jun 2016, at 23:54, Camille Fournier <camille@apache.org> wrote:
> You don't need weights to run this cluster successfully across 2
> datacenters, unless you want to run with 4 live read/write nodes which
> isn't really a recommended setup (we advise odd numbers because running
> with even numbers doesn't generally buy you anything).
> I would probably run 3 voting members, 1 observer if you want to run 4
> nodes. In that setup you can lose any one voting node, and of course the
> observer, and be fine. If you lose 2 voting nodes, whether in the same DC
> or x-DC, you will not be able to continue. But votes only need to be acked
> by any 2 servers to be committed.
> In the case of weights and 4 servers, you will either need to ack both of
> the servers in the weighted datacenter or the 2 in the unweighted DC and
> one in the weighted DC.
> I've actually yet to see the killer app for using hierarchy and weights
> although I'd be interested in hearing about it if someone has an example.
> It's not clear that there's a huge value here unless the observer is
> significantly less effective than a full r/w quorum member which would be
> surprising.
> C
> On Fri, Jun 3, 2016 at 6:33 PM, Dan Benediktson <
> dbenediktson@twitter.com.invalid> wrote:
>> Weights will at least let you do better: if you weight it, you can make it
>> so that datacenter A will survive even if datacenter B goes down, but not
>> the other way around. While not ideal, it's probably better than the
>> non-weighted alternative. (2, 2, 1, 1) weights might work fairly well - as
>> long as any three machines are up, or both machines in the preferred
>> datacenter, quorum can be achieved.
>> On Fri, Jun 3, 2016 at 3:23 PM, Camille Fournier <camille@apache.org>
>> wrote:
>>> You can't solve this with weights.
>>> On Jun 3, 2016 6:03 PM, "Michael Han" <hanm@cloudera.com> wrote:
>>>> ZK supports more than just majority quorum rule, there are also
>> weights /
>>>> hierarchy of groups based quorum [1]. So probably one can assign more
>>>> weights to one out of two data center which can form a weight based
>>> quorum
>>>> even if another DC is failing?
>>>> Another idea is to instead of forming a single ZK ensemble across DCs,
>>>> forming multiple ZK ensembles across DCs with one ensemble per DC. This
>>>> solution might be applicable for heavy read / light write workload
>> while
>>>> providing certain degree of fault tolerance. Some relevant discussions
>>> [2].
>>>> [1] https://zookeeper.apache.org/doc/r3.4.5/zookeeperAdmin.html (
>>>> weight.x=nnnnn, group.x=nnnnn[:nnnnn]).
>>>> [2]
>> https://www.quora.com/Has-anyone-deployed-a-ZooKeeper-ensemble-across-data-centers
>>>> On Fri, Jun 3, 2016 at 2:16 PM, Alexander Shraer <shralex@gmail.com>
>>>> wrote:
>>>>>> Is there any settings to override the quorum rule? Would you know
>> the
>>>>> rationale behind it?
>>>>> The rule comes from a theoretical impossibility saying that you must
>>>> have n
>>>>>> 2f replicas
>>>>> to tolerate f failures, for any algorithm trying to solve consensus
>>> while
>>>>> being able to handle
>>>>> periods of asynchrony (unbounded message delays, processing times,
>>> etc).
>>>>> The earliest proof is probably here: paper
>>>>> <
>>> https://www.cs.utexas.edu/~lorenzo/corsi/cs380d/papers/p225-chandra.pdf
>>>>> .
>>>>> ZooKeeper is assuming this model, so the bound applies
>>>>> to it.
>>>>> The intuition is what's called a 'partition argument'. Essentially if
>>>> only
>>>>> 2f replicas were sufficient, you
>>>>> could arbitrarily divide them into 2 sets of f replicas, and create a
>>>>> situation where each set of f
>>>>> must go on independently without coordinating with the other set
>> (split
>>>>> brain), when the links between the two sets are slow (i.e., a network
>>>>> partition),
>>>>> simply because the other set could also be down (the algorithm
>>> tolerates
>>>> f
>>>>> failures) and it can't distinguish the two situations.
>>>>> When n > 2f this can be avoided since one of the sets will have
>>> majority
>>>>> while the other set won't.
>>>>> The key here is that the links between the two data centers can
>>>> arbitrarily
>>>>> delay messages, so an automatic
>>>>> 'fail-over' where one data center decides that the other one is down
>> is
>>>>> usually considered unsafe. If in your system
>>>>> you have a reliable way to know that the other data center is really
>> in
>>>>> fact down (this is a synchrony assumption), you could do as Camille
>>>>> suggested and
>>>>> reconfigure the system to only include the remaining data center.
>> This
>>>>> would still be very tricky to do since this reconfiguration
>>>>> would have to involve manually changing configuration files and
>>> rebooting
>>>>> servers, while somehow making sure that you're
>>>>> not loosing committed state. So not recommended.
>>>>> On Fri, Jun 3, 2016 at 11:30 PM, Camille Fournier <
>> camille@apache.org>
>>>>> wrote:
>>>>>> 2 servers is the same as 1 server wrt fault tolerance, so yes, you
>>> are
>>>>>> correct. If they want fault tolerance, they have to run 3 (or
>> more).
>>>>>> On Fri, Jun 3, 2016 at 4:25 PM, Shawn Heisey <apache@elyograg.org>
>>>>> wrote:
>>>>>>> On 6/3/2016 1:44 PM, Nomar Morado wrote:
>>>>>>>> Is there any settings to override the quorum rule? Would
>> know
>>>> the
>>>>>>>> rationale behind it? Ideally, you will want to operate the
>>>>> application
>>>>>>>> even if at least one data center is up.
>>>>>>> I do not know if the quorum rule can be overridden, or whether
>> your
>>>>>>> application can tell the difference between a loss of quorum
>>>>>>> zookeeper going down entirely.  I really don't know anything
>> about
>>>>>>> zookeeper client code or zookeeper internals.
>>>>>>> From what I understand, majority quorum is the only way to be
>>>>>>> *completely* sure that cluster software like SolrCloud or your
>>>>>>> application can handle write operations with confidence that
>>> are
>>>>>>> applied correctly.  If you lose quorum, which will happen if
>>> one
>>>>> DC
>>>>>>> is operational, then your application should go read-only.  This
>> is
>>>>> what
>>>>>>> SolrCloud does.
>>>>>>> I am a committer on the Apache Solr project, and Solr uses
>>> zookeeper
>>>>>>> when it is running in SolrCloud mode.  The cloud code is handled
>> by
>>>>>>> other people -- I don't know much about it.
>>>>>>> I joined this list because I wanted to have the ZK devs include
>>>>>>> clarification in zookeeper documentation -- oddly enough, related
>>> to
>>>>> the
>>>>>>> very thing we are discussing.  I wanted to be sure that the
>>>>>>> documentation explicitly mentioned that three serversare required
>>>> for a
>>>>>>> fault-tolerant setup.  Some SolrCloud users don't want to accept
>>> this
>>>>> as
>>>>>>> a fact, and believe that two servers should be enough.
>>>>>>> Thanks,
>>>>>>> Shawn
>>>> --
>>>> Cheers
>>>> Michael.

View raw message