Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@cassandra.apache.org
Received-SPF: neutral (nike.apache.org: local policy)
From: Wayne Lewis <wayne@lewisclan.org>
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: quoted-printable
Subject: multiple datacenter with low replication factor - idea for greater
 flexibility
Date: Wed, 10 Nov 2010 11:43:15 -0800
Message-Id: <4999B487-18CC-44D5-AF74-1C88FBC09DC3@lewisclan.org>
To: user@cassandra.apache.org
Mime-Version: 1.0 (Apple Message framework v1081)

Hello,

We've had Cassandra running in a single production data center now for =
several months and have started detailed plans to add data center fault =
tolerance.

Our requirements do not appear to be solved out-of-the-box with =
Cassandra. I'd like to share a solution we're planning and find others =
considering similar problems.

We require the following:

1. Two data centers
One is primary, the other hot standby to be used when primary fails. Of =
course Cassandra has no such bias, but as will be seen below this =
becomes important when considering app latency.

2. No more than 3 copies of data total
We are storing blob-like objects. Cost per unit of usable storage is =
closely scrutinized vs other solutions. Hence we want to keep =
replication factor low.
Two copies will be held in the primary DC, 1 in the secondary DC - with =
the corresponding ratio of machines in each DC.

3. Immediate consistency

4. No waiting on remote data center
The application front-end runs in the primary data center and expects =
that operations using a local coordinator node will not suffer a =
response time determined by the WAN. Hence we cannot require a response =
from the node in the secondary data center to achieve quorum.

5. Ability to operate with a single working node per key, if necessary
We wish to temporarily operate with even a single working node per token =
in desperate situations involving data center failures or combinations =
of node and data center failure.


Existing Cassandra solutions offer combinations of the above, but it is =
not at all clear how to achieve all the above without custom work.=20
Normal quorum with N=3D3 can only work with a single down node =
regardless of topology. Furthermore if one node in the primary DC fails, =
quorum requires synchronous operations over the WAN.
NetworkTopologyStrategy is nice, but requiring quorum in the primary DC =
with 2 nodes means no tolerance to a single node failure there.
If we're overlooking something I'd love to know.


Hence the following proposal for a new replication strategy we're =
calling SubQuorum.

In short SubQuorum allows administratively marking some nodes as being =
exempt from participating in quorum. As all nodes agree as to exemption =
status, consistency is still guaranteed as quorum is still achieved =
amongst the remaining nodes. We gain tremendous flexibility to deal with =
node and DC failures. Exempt nodes, if up, still receive mutation =
messages as usual.

For example : If a primary DC node fails we can mark its remote =
counterpart exempt from quorum, hence allowing continued operation =
without a synchronous call over the WAN.

Or another example : If the primary DC fails we mark all primary DC =
nodes exempt and move the entire application to the secondary DC where =
it runs as usual but with just the one copy.


The implementation is trivial and consists of two pieces:

1. Exempt node management. The list of exempt nodes is broadcast out of =
band. In our case we're leveraging puppet and a admin server.

2. We've written an implementation of AbstractReplicationStrategy that =
returns custom QuorumResponseHandler and IWriteResponseHandler. These =
simply wait for quorum amongst non-exempt nodes.
This requires a small change to the AbstractReplicationStrategy =
interface to pass the endpoints to getQuorumResponseHandler and =
getWriteResponseHandler, but otherwise changes are contained in the =
plugin.


There is more analysis I can share if anyone is interested. But at this =
point I'd like to get feedback.

Thanks,
Wayne Lewis