Mailing-List: contact commits-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@cassandra.apache.org
Date: Wed, 24 Apr 2013 12:49:15 +0000 (UTC)
From: "Jonathan Ellis (JIRA)" <jira@apache.org>
To: commits@cassandra.apache.org
Message-ID: <JIRA.12644170.1366742254756.215814.1366807755650@arcas>
In-Reply-To: <JIRA.12644170.1366742254756@arcas>
References: <JIRA.12644170.1366742254756@arcas>
Subject: [jira] [Commented] (CASSANDRA-5509) Decouple Consistency &
 Durability
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable


    [ https://issues.apache.org/jira/browse/CASSANDRA-5509?page=3Dcom.atlas=
sian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=3D=
13640394#comment-13640394 ]=20

Jonathan Ellis commented on CASSANDRA-5509:
-------------------------------------------

Up until 0.6, this is how hinted handoff worked.  The problem was that one =
node going down would tend to morph into a cascading failure storm, since w=
e're maintaining the number of copies written with less machines to do it w=
ith.

We have more robust code now to reject writes with UAE if hints are getting=
 backed up, so this is not necessarily a -1, but it's something to be cauti=
ous about.
               =20
> Decouple Consistency & Durability
> ---------------------------------
>
>                 Key: CASSANDRA-5509
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-5509
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>            Reporter: Rick Branson
>
> Right now in Cassandra, consistency and durability are intertwined in a w=
ay that is unnecessary. In environments where nodes have unreliable local s=
torage, the consistency level of writes must be increased to N+1 to ensure =
that N host failure(s) don't cause data loss, even if it's acceptable that =
consistency is weaker. The probability of data loss is also heavily influen=
ced by entropy. An example is if the client chooses a replica as the write =
coordinator for a CL.ONE write, the risk of losing that data increases subs=
tantially. During a node outage, the chance of data loss is elevated for a =
relatively long time: the entire length of the node outage + recovery time.=
 The required increase in consistency level has real impact: it creates the=
 potential for availability issues during routine maintenance as an unlucky=
 node failure can cause writes to start failing. It's also generally consid=
ered a best practice that each datacenter has at least 3 replicas of data, =
even if quorums for consistency are not required, as it's the only way to e=
nsure strong durability in the face of transient inter-DC failures.
> I found a relevant paper that provides some theoretical grounding while r=
esearching: http://www.cs.ubc.ca/~geoffrey/papers/durability-sigops04.pdf
> I'd like to propose that in the event of a down replica, the coordinator =
attempts to achieve RF by distributing "remote hints" to RF-liveReplicaCoun=
t non-replica nodes. If the coordinator itself is a non-replica, it would b=
e an acceptable choice for a remote hint as well. This would achieve RF lev=
el durability without the availability penalty of increasing consistency. T=
his would also allow decreasing the (global) RF, as RF durability goals cou=
ld still be achieved during transient inter-DC failures, requiring just RF =
nodes in each DC, instead of RF replicas in each DC. Even better would be i=
f the selection of remote hint nodes respected the replication strategy and=
 was able to achieve the cross-rack / cross-DC durability.
> While ConsistencyLevel is a pretty overloaded concept at this point, and =
I think it'd be great to add a DurabilityLevel to each write, I understand =
that this is likely not pragmatic. Therefore, considering that the CL.TWO a=
nd CL.THREE options were added largely for durability reasons, I propose th=
at they be repurposed to support durability goals and remote hinting. They =
would require 1 replica ACK and CL-1 (replica|hint) ACKs. It also might be =
desirable to extend the "ANY" option to require multiple hint ACKs, such as=
 CL.ANY_TWO or CL.ANY_THREE, which would support combined very high durabil=
ity and very high availability. All CLs will benefit as remote hinting vast=
ly tightens the window of elevated data loss chance during a node outage fr=
om nodeOutageDuration + recoveryDuration to the time it takes for the coord=
inator to distribute remote hints.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrato=
rs
For more information on JIRA, see: http://www.atlassian.com/software/jira