Return-Path: Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: (qmail 50074 invoked from network); 7 Jul 2010 16:24:49 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 7 Jul 2010 16:24:49 -0000 Received: (qmail 23452 invoked by uid 500); 7 Jul 2010 16:24:48 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 23364 invoked by uid 500); 7 Jul 2010 16:24:47 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 23356 invoked by uid 99); 7 Jul 2010 16:24:47 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 07 Jul 2010 16:24:47 +0000 X-ASF-Spam-Status: No, hits=2.2 required=10.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_NONE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of mason@onespot.com designates 74.125.83.44 as permitted sender) Received: from [74.125.83.44] (HELO mail-gw0-f44.google.com) (74.125.83.44) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 07 Jul 2010 16:24:39 +0000 Received: by gwb10 with SMTP id 10so4093603gwb.31 for ; Wed, 07 Jul 2010 09:23:18 -0700 (PDT) MIME-Version: 1.0 Received: by 10.90.118.19 with SMTP id q19mr6113861agc.68.1278519797161; Wed, 07 Jul 2010 09:23:17 -0700 (PDT) Received: by 10.90.81.4 with HTTP; Wed, 7 Jul 2010 09:23:17 -0700 (PDT) Date: Wed, 7 Jul 2010 11:23:17 -0500 Message-ID: Subject: Cluster performance degrades if any single node is slow From: Mason Hale To: user@cassandra.apache.org Content-Type: multipart/alternative; boundary=0016361e88c65920f4048ace963f X-Virus-Checked: Checked by ClamAV on apache.org --0016361e88c65920f4048ace963f Content-Type: text/plain; charset=ISO-8859-1 We've been experiencing some cluster-wide performance issues if any single node in the cluster is performing poorly. For example this occurs if compaction is running on any node in the cluster, or if a new node is being bootstrapped. We believe the root cause of this issue is a performance optimization in Cassandra that requests the "full" data from only a single node in the cluster, and MD5 checksums of the same data from all other nodes (depending on the consistency level of the read) for a given request. The net effect of this optimization is the read will block until the data is received from the node that is replying with the full data, even if all other nodes are responding much more quickly. Thus the entire cluster is only as fast as the slowest node for some fraction of all requests. The portion of requests sent to the slow node is a function of the total cluster size, replication factor and consistency level. For smallish clusters (e.g. 10 or fewer servers) this performance degradation can be quite pronounced. CASSANDRA-981 (https://issues.apache.org/jira/browse/CASSANDRA-981) discusses this issue and proposes the solution of dynamically identifying slow nodes and automatically treating them as if they were on a remote network, thus preventing certain performance critical operations (such as full data requests) from being performed on that node. This seems like a fine solution. However, a design that requires any read operation to wait on the reply from a specific single node seems counter to the fundamental design goal of avoiding any single points of failure. In this case, a single node with degraded performance (but still online) can dramatically reduce the overall performance of the cluster. The proposed solution would dynamically detect this condition and take evasive action when the condition is detected, but it would require some number of requests to perform poorly before a slow node is detected. It also smells like a complex solution that could have some unexpected side-effects and edge-cases. I wonder if a simpler solution would be more effective here? In the same way that hinted handoff can now be disabled via configuration, would it be feasible to optionally turn off this optimization? This way I can make the trade-off decision between the incremental performance improvement from this optimization or more reliable cluster-wide performance. Ideally, I would be able to configure how many nodes should reply with "full data" with each request. Thus I could increase this from 1 to 2 to avoid cluster-wide performance degradation if any single node is performing poorly. By being able to turn off or tune this setting I would also be able to do some a/b testing to evaluate what performance benefit is being gained by this optimization. I'm curious to know if anyone else has run into this issue, and if anyone else wishes they could turn off or tune this "full data"/md5 performance optimization? thanks, Mason --0016361e88c65920f4048ace963f Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable
We've been experienci= ng some cluster-wide performance issues if any single node in the cluster i= s performing poorly. For example this occurs if compaction is running on an= y node in the cluster, or if a new node is being bootstrapped.=A0

We believe the root cause of this issue is a performanc= e optimization in Cassandra that requests the "full" data from on= ly a single node in the cluster, and MD5 checksums of the same data from al= l other nodes (depending on the consistency level of the read) for a given = request. The net effect of this optimization is the read will block until t= he data is received from the node that is replying with the full data, even= if all other nodes are responding much more quickly. Thus the entire clust= er is only as fast as the slowest node for some fraction of all requests. T= he portion of requests sent to the slow node is a function of the total clu= ster size, replication factor and consistency level. For smallish clusters = (e.g. 10 =A0or fewer servers) this performance degradation can be quite pro= nounced.=A0

CASSANDRA-981 (