Return-Path: X-Original-To: apmail-cassandra-commits-archive@www.apache.org Delivered-To: apmail-cassandra-commits-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 1F2A518712 for ; Fri, 2 Oct 2015 14:21:27 +0000 (UTC) Received: (qmail 44111 invoked by uid 500); 2 Oct 2015 14:21:26 -0000 Delivered-To: apmail-cassandra-commits-archive@cassandra.apache.org Received: (qmail 44076 invoked by uid 500); 2 Oct 2015 14:21:26 -0000 Mailing-List: contact commits-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@cassandra.apache.org Delivered-To: mailing list commits@cassandra.apache.org Received: (qmail 44059 invoked by uid 99); 2 Oct 2015 14:21:26 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 02 Oct 2015 14:21:26 +0000 Date: Fri, 2 Oct 2015 14:21:26 +0000 (UTC) From: "Jonathan Ellis (JIRA)" To: commits@cassandra.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (CASSANDRA-10244) Replace heartbeats with locally recorded metrics for failure detection MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/CASSANDRA-10244?page=3Dcom.atla= ssian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId= =3D14941186#comment-14941186 ]=20 Jonathan Ellis commented on CASSANDRA-10244: -------------------------------------------- Would we expect this approach to do better or worse for the scenario descri= bed here? http://ucare.cs.uchicago.edu/pdf/socc13-limplock.pdf {quote} In our experiments, the client connects to a healthy coordinator, which then forwards requests to other replica nodes where one of them has a degraded NIC. At this point, we only analyze get and put protocols. Based on our initial results, Cassandra=E2=80=99s architecture is in general limplock free, and only exhibits 2x slowdown. We are still in the process of crafting more benchmarks to unearth any possible limplock cases. ... After further diagnosis, we find that the 2x slowdown is due to =E2=80=9Cflapping=E2=80=9D, a condition where peers see the slow node dead and up continuously as the node=E2=80=99s gossip messages are buried in congestion. Due to flapping, the coordinator=E2=80=99s write stage continuously stores and forwards hints. This flapping-induced background work leads to extra work by Java GC, which is the cause of 2x slowdown. {quote} > Replace heartbeats with locally recorded metrics for failure detection > ---------------------------------------------------------------------- > > Key: CASSANDRA-10244 > URL: https://issues.apache.org/jira/browse/CASSANDRA-1024= 4 > Project: Cassandra > Issue Type: Improvement > Components: Core > Reporter: Jason Brown > Assignee: Jason Brown > > In the current implementation, the primary purpose of sending gossip mess= ages is for delivering the updated heartbeat values of each node in a clust= er. The other data that is passed in gossip (node metadata such as status, = dc, rack, tokens, and so on) changes very infrequently (or rarely), such th= at the eventual delivery of that data is entirely reasonable. Heartbeats, h= owever, are quite different. A continuous and nearly consistent delivery ti= me of updated heartbeats is critical for the stability of a cluster. It is = through the receipt of the updated heartbeat that a node determines the rea= chability (UP/DOWN status) of all peers in the cluster. The current impleme= ntation of FailureDetector measures the time differences between the heartb= eat updates received about a peer (Note: I said about a peer, not from the = peer directly, as those values are disseminated via gossip). Without a cons= istent time delivery of those updates, the FD, via it's use of the PHI-accr= ual algorigthm, will mark the peer as DOWN (unreachable). The two nodes cou= ld be sending all other traffic without problem, but if the heartbeats are = not propagated correctly, each of the nodes will mark the other as DOWN, wh= ich is clearly suboptimal to cluster health. Further, heartbeat updates are= the only mechanism we use to determine reachability (UP/DOWN) of a peer; d= ynamic snitch measurements, for example, are not included in the determinat= ion.=20 > To illustrate this, in the current implementation, assume a cluster of no= des: A, B, and C. A partition starts between nodes A and C (no communicatio= n succeeds), but both nodes can communicate with B. As B will get the updat= ed heartbeats from both A and C, it will, via gossip, send those over to th= e other node. Thus, A thinks C is UP, and C thinks A is UP. Unfortunately, = due to the partition between them, all communication between A and C will f= ail, yet neither node will mark the other as down because each is receiving= , transitively via B, the updated heartbeat about the other. While it's tru= e that the other node is alive, only having transitive knowledge about a pe= er, and allowing that to be the sole determinant of UP/DOWN reachability st= atus, is not sufficient for a correct and effieicently operating cluster.= =20 > This transitive availability is suboptimal, and I propose we drop the hea= rtbeat concept altogether. Instead, the dynamic snitch should become more i= ntelligent, and it's measurements ultimately become the input for determini= ng the reachability status of each peer(as fed into a revamped FD). As we a= lready capture latencies in the dsntich, we can reasonably extend it to inc= lude timeouts/missed responses, and make that the basis for the UP/DOWN dec= isioning. Thus we will have more accurate and relevant peer statueses that = is tailored to the local node. =20 -- This message was sent by Atlassian JIRA (v6.3.4#6332)