Return-Path: Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: (qmail 30810 invoked from network); 10 Nov 2010 19:43:20 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 10 Nov 2010 19:43:20 -0000 Received: (qmail 58861 invoked by uid 500); 10 Nov 2010 19:43:50 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 58841 invoked by uid 500); 10 Nov 2010 19:43:50 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 58833 invoked by uid 99); 10 Nov 2010 19:43:50 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 10 Nov 2010 19:43:50 +0000 X-ASF-Spam-Status: No, hits=4.3 required=10.0 tests=FS_REPLICA,RCVD_IN_DNSWL_NONE,SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (nike.apache.org: local policy) Received: from [209.85.213.172] (HELO mail-yx0-f172.google.com) (209.85.213.172) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 10 Nov 2010 19:43:41 +0000 Received: by yxs7 with SMTP id 7so302671yxs.31 for ; Wed, 10 Nov 2010 11:43:19 -0800 (PST) Received: by 10.100.213.16 with SMTP id l16mr5266832ang.148.1289418199455; Wed, 10 Nov 2010 11:43:19 -0800 (PST) Received: from [192.168.1.2] ([67.208.96.94]) by mx.google.com with ESMTPS id i37sm1212077anh.34.2010.11.10.11.43.17 (version=TLSv1/SSLv3 cipher=RC4-MD5); Wed, 10 Nov 2010 11:43:18 -0800 (PST) From: Wayne Lewis Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: quoted-printable Subject: multiple datacenter with low replication factor - idea for greater flexibility Date: Wed, 10 Nov 2010 11:43:15 -0800 Message-Id: <4999B487-18CC-44D5-AF74-1C88FBC09DC3@lewisclan.org> To: user@cassandra.apache.org Mime-Version: 1.0 (Apple Message framework v1081) X-Mailer: Apple Mail (2.1081) X-Virus-Checked: Checked by ClamAV on apache.org Hello, We've had Cassandra running in a single production data center now for = several months and have started detailed plans to add data center fault = tolerance. Our requirements do not appear to be solved out-of-the-box with = Cassandra. I'd like to share a solution we're planning and find others = considering similar problems. We require the following: 1. Two data centers One is primary, the other hot standby to be used when primary fails. Of = course Cassandra has no such bias, but as will be seen below this = becomes important when considering app latency. 2. No more than 3 copies of data total We are storing blob-like objects. Cost per unit of usable storage is = closely scrutinized vs other solutions. Hence we want to keep = replication factor low. Two copies will be held in the primary DC, 1 in the secondary DC - with = the corresponding ratio of machines in each DC. 3. Immediate consistency 4. No waiting on remote data center The application front-end runs in the primary data center and expects = that operations using a local coordinator node will not suffer a = response time determined by the WAN. Hence we cannot require a response = from the node in the secondary data center to achieve quorum. 5. Ability to operate with a single working node per key, if necessary We wish to temporarily operate with even a single working node per token = in desperate situations involving data center failures or combinations = of node and data center failure. Existing Cassandra solutions offer combinations of the above, but it is = not at all clear how to achieve all the above without custom work.=20 Normal quorum with N=3D3 can only work with a single down node = regardless of topology. Furthermore if one node in the primary DC fails, = quorum requires synchronous operations over the WAN. NetworkTopologyStrategy is nice, but requiring quorum in the primary DC = with 2 nodes means no tolerance to a single node failure there. If we're overlooking something I'd love to know. Hence the following proposal for a new replication strategy we're = calling SubQuorum. In short SubQuorum allows administratively marking some nodes as being = exempt from participating in quorum. As all nodes agree as to exemption = status, consistency is still guaranteed as quorum is still achieved = amongst the remaining nodes. We gain tremendous flexibility to deal with = node and DC failures. Exempt nodes, if up, still receive mutation = messages as usual. For example : If a primary DC node fails we can mark its remote = counterpart exempt from quorum, hence allowing continued operation = without a synchronous call over the WAN. Or another example : If the primary DC fails we mark all primary DC = nodes exempt and move the entire application to the secondary DC where = it runs as usual but with just the one copy. The implementation is trivial and consists of two pieces: 1. Exempt node management. The list of exempt nodes is broadcast out of = band. In our case we're leveraging puppet and a admin server. 2. We've written an implementation of AbstractReplicationStrategy that = returns custom QuorumResponseHandler and IWriteResponseHandler. These = simply wait for quorum amongst non-exempt nodes. This requires a small change to the AbstractReplicationStrategy = interface to pass the endpoints to getQuorumResponseHandler and = getWriteResponseHandler, but otherwise changes are contained in the = plugin. There is more analysis I can share if anyone is interested. But at this = point I'd like to get feedback. Thanks, Wayne Lewis