Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@cassandra.apache.org
Received-SPF: pass (nike.apache.org: domain of mason@onespot.com designates
 74.125.83.44 as permitted sender)
MIME-Version: 1.0
Date: Wed, 7 Jul 2010 11:23:17 -0500
Message-ID: <AANLkTim1qERIyusrM5O-RQ4aMSvE85imDs0VW3s9-SyL@mail.gmail.com>
Subject: Cluster performance degrades if any single node is slow
From: Mason Hale <mason@onespot.com>
To: user@cassandra.apache.org
Content-Type: multipart/alternative; boundary=0016361e88c65920f4048ace963f

--0016361e88c65920f4048ace963f
Content-Type: text/plain; charset=ISO-8859-1

We've been experiencing some cluster-wide performance issues if any single
node in the cluster is performing poorly. For example this occurs if
compaction is running on any node in the cluster, or if a new node is being
bootstrapped.

We believe the root cause of this issue is a performance optimization in
Cassandra that requests the "full" data from only a single node in the
cluster, and MD5 checksums of the same data from all other nodes (depending
on the consistency level of the read) for a given request. The net effect of
this optimization is the read will block until the data is received from the
node that is replying with the full data, even if all other nodes are
responding much more quickly. Thus the entire cluster is only as fast as the
slowest node for some fraction of all requests. The portion of requests sent
to the slow node is a function of the total cluster size, replication factor
and consistency level. For smallish clusters (e.g. 10  or fewer servers)
this performance degradation can be quite pronounced.

CASSANDRA-981 (https://issues.apache.org/jira/browse/CASSANDRA-981)
discusses this issue and proposes the solution of dynamically identifying
slow nodes and automatically treating them as if they were on a remote
network, thus preventing certain performance critical operations (such as
full data requests) from being performed on that node. This seems like a
fine solution.

However, a design that requires any read operation to wait on the reply from
a specific single node seems counter to the fundamental design goal of
avoiding any single points of failure. In this case, a single node with
degraded performance (but still online) can dramatically reduce the overall
performance of the cluster. The proposed solution would dynamically detect
this condition and take evasive action when the condition is detected, but
it would require some number of requests to perform poorly before a slow
node is detected. It also smells like a complex solution that could have
some unexpected side-effects and edge-cases.

I wonder if a simpler solution would be more effective here? In the same way
that hinted handoff can now be disabled via configuration, would it be
feasible to optionally turn off this optimization? This way I can make the
trade-off decision between the incremental performance improvement from this
optimization or more reliable cluster-wide performance. Ideally, I would be
able to configure how many nodes should reply with "full data" with each
request. Thus I could increase this from 1 to 2 to avoid cluster-wide
performance degradation if any single node is performing poorly. By being
able to turn off or tune this setting I would also be able to do some a/b
testing to evaluate what performance benefit is being gained by this
optimization.

I'm curious to know if anyone else has run into this issue, and if anyone
else wishes they could turn off or tune this "full data"/md5 performance
optimization?

thanks,
Mason

--0016361e88c65920f4048ace963f
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

<span class=3D"Apple-style-span" style=3D"font-family: arial, sans-serif; f=
ont-size: 13px; border-collapse: collapse; "><div>We&#39;ve been experienci=
ng some cluster-wide performance issues if any single node in the cluster i=
s performing poorly. For example this occurs if compaction is running on an=
y node in the cluster, or if a new node is being bootstrapped.=A0</div>
<div><br></div><div>We believe the root cause of this issue is a performanc=
e optimization in Cassandra that requests the &quot;full&quot; data from on=
ly a single node in the cluster, and MD5 checksums of the same data from al=
l other nodes (depending on the consistency level of the read) for a given =
request. The net effect of this optimization is the read will block until t=
he data is received from the node that is replying with the full data, even=
 if all other nodes are responding much more quickly. Thus the entire clust=
er is only as fast as the slowest node for some fraction of all requests. T=
he portion of requests sent to the slow node is a function of the total clu=
ster size, replication factor and consistency level. For smallish clusters =
(e.g. 10 =A0or fewer servers) this performance degradation can be quite pro=
nounced.=A0</div>
<div><br></div><div>CASSANDRA-981 (<a href=3D"https://issues.apache.org/jir=
a/browse/CASSANDRA-981" target=3D"_blank" style=3D"color: rgb(17, 65, 112);=
 ">https://issues.apache.org/jira/browse/CASSANDRA-981</a>) discusses this =
issue and proposes the solution of dynamically identifying slow nodes and a=
utomatically treating them as if they were on a remote network, thus preven=
ting certain performance critical operations (such as full data requests) f=
rom being performed on that node. This seems like a fine solution.=A0</div>
<div><br></div><div>However, a design that requires any read operation to w=
ait on the reply from a specific single node seems counter to the fundament=
al design goal of avoiding any single points of failure. In this case, a si=
ngle node with degraded performance (but still online) can dramatically red=
uce the overall performance of the cluster. The proposed solution would dyn=
amically detect this condition and take evasive action when the condition i=
s detected, but it would require some number of requests to perform poorly =
before a slow node is detected. It also smells like a complex solution that=
 could have some unexpected side-effects and edge-cases.</div>
<div><br></div><div>I wonder if a simpler solution would be more effective =
here? In the same way that hinted handoff can now be disabled via configura=
tion, would it be feasible to optionally turn off this optimization? This w=
ay I can make the trade-off decision between the incremental performance im=
provement from this optimization or more reliable cluster-wide performance.=
 Ideally, I would be able to configure how many nodes should reply with &qu=
ot;full data&quot; with each request. Thus I could increase this from 1 to =
2 to avoid cluster-wide performance degradation if any single node is perfo=
rming poorly. By being able to turn off or tune this setting I would also b=
e able to do some a/b testing to evaluate what performance benefit is being=
 gained by this optimization.</div>
<div><br></div><div>I&#39;m curious to know if anyone else has run into thi=
s issue, and if anyone else wishes they could turn off or tune this &quot;f=
ull data&quot;/md5 performance optimization?</div><div><br></div><div>thank=
s,</div>
<div>Mason</div><div><br></div></span>

--0016361e88c65920f4048ace963f--