Return-Path: X-Original-To: apmail-cassandra-commits-archive@www.apache.org Delivered-To: apmail-cassandra-commits-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id B5F011063F for ; Wed, 24 Apr 2013 12:49:17 +0000 (UTC) Received: (qmail 37615 invoked by uid 500); 24 Apr 2013 12:49:17 -0000 Delivered-To: apmail-cassandra-commits-archive@cassandra.apache.org Received: (qmail 37441 invoked by uid 500); 24 Apr 2013 12:49:17 -0000 Mailing-List: contact commits-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@cassandra.apache.org Delivered-To: mailing list commits@cassandra.apache.org Received: (qmail 37290 invoked by uid 99); 24 Apr 2013 12:49:15 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 24 Apr 2013 12:49:15 +0000 Date: Wed, 24 Apr 2013 12:49:15 +0000 (UTC) From: "Jonathan Ellis (JIRA)" To: commits@cassandra.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (CASSANDRA-5509) Decouple Consistency & Durability MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/CASSANDRA-5509?page=3Dcom.atlas= sian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=3D= 13640394#comment-13640394 ]=20 Jonathan Ellis commented on CASSANDRA-5509: ------------------------------------------- Up until 0.6, this is how hinted handoff worked. The problem was that one = node going down would tend to morph into a cascading failure storm, since w= e're maintaining the number of copies written with less machines to do it w= ith. We have more robust code now to reject writes with UAE if hints are getting= backed up, so this is not necessarily a -1, but it's something to be cauti= ous about. =20 > Decouple Consistency & Durability > --------------------------------- > > Key: CASSANDRA-5509 > URL: https://issues.apache.org/jira/browse/CASSANDRA-5509 > Project: Cassandra > Issue Type: Improvement > Components: Core > Reporter: Rick Branson > > Right now in Cassandra, consistency and durability are intertwined in a w= ay that is unnecessary. In environments where nodes have unreliable local s= torage, the consistency level of writes must be increased to N+1 to ensure = that N host failure(s) don't cause data loss, even if it's acceptable that = consistency is weaker. The probability of data loss is also heavily influen= ced by entropy. An example is if the client chooses a replica as the write = coordinator for a CL.ONE write, the risk of losing that data increases subs= tantially. During a node outage, the chance of data loss is elevated for a = relatively long time: the entire length of the node outage + recovery time.= The required increase in consistency level has real impact: it creates the= potential for availability issues during routine maintenance as an unlucky= node failure can cause writes to start failing. It's also generally consid= ered a best practice that each datacenter has at least 3 replicas of data, = even if quorums for consistency are not required, as it's the only way to e= nsure strong durability in the face of transient inter-DC failures. > I found a relevant paper that provides some theoretical grounding while r= esearching: http://www.cs.ubc.ca/~geoffrey/papers/durability-sigops04.pdf > I'd like to propose that in the event of a down replica, the coordinator = attempts to achieve RF by distributing "remote hints" to RF-liveReplicaCoun= t non-replica nodes. If the coordinator itself is a non-replica, it would b= e an acceptable choice for a remote hint as well. This would achieve RF lev= el durability without the availability penalty of increasing consistency. T= his would also allow decreasing the (global) RF, as RF durability goals cou= ld still be achieved during transient inter-DC failures, requiring just RF = nodes in each DC, instead of RF replicas in each DC. Even better would be i= f the selection of remote hint nodes respected the replication strategy and= was able to achieve the cross-rack / cross-DC durability. > While ConsistencyLevel is a pretty overloaded concept at this point, and = I think it'd be great to add a DurabilityLevel to each write, I understand = that this is likely not pragmatic. Therefore, considering that the CL.TWO a= nd CL.THREE options were added largely for durability reasons, I propose th= at they be repurposed to support durability goals and remote hinting. They = would require 1 replica ACK and CL-1 (replica|hint) ACKs. It also might be = desirable to extend the "ANY" option to require multiple hint ACKs, such as= CL.ANY_TWO or CL.ANY_THREE, which would support combined very high durabil= ity and very high availability. All CLs will benefit as remote hinting vast= ly tightens the window of elevated data loss chance during a node outage fr= om nodeOutageDuration + recoveryDuration to the time it takes for the coord= inator to distribute remote hints. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrato= rs For more information on JIRA, see: http://www.atlassian.com/software/jira