Mailing-List: contact commits-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Date: Fri, 2 Jun 2017 04:27:04 +0000 (UTC)
From: "Simon Zhou (JIRA)" <jira@apache.org>
To: commits@cassandra.apache.org
Message-ID: <JIRA.12703077.1395574963000.348396.1496377624555@Atlassian.JIRA>
In-Reply-To: <JIRA.12703077.1395574963000@Atlassian.JIRA>
References: <JIRA.12703077.1395574963000@Atlassian.JIRA> <JIRA.12703077.1395574963175@jira-lw-us.apache.org>
Subject: [jira] [Commented] (CASSANDRA-6908) Dynamic endpoint snitch
 destabilizes cluster under heavy load
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable
archived-at: Fri, 02 Jun 2017 04:27:11 -0000


    [ https://issues.apache.org/jira/browse/CASSANDRA-6908?page=3Dcom.atlas=
sian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=3D=
16034112#comment-16034112 ]=20

Simon Zhou commented on CASSANDRA-6908:
---------------------------------------

We got similar issue and thus I worked out a simple patch (attached) to dec=
ouple scores for iowait and sampled read latency. From my observation, ther=
e are two issues:
1. The iowait score of one node changes frequently and the gaps among the s=
cores for different nodes are usually far beyond the default 1.1 threshold.
2. The (median) latency scores don't vary too much however some nodes have =
0 latency scores, even with the fix for CASSANDRA-13074 (we're running 3.0.=
13).

There are the numbers I got (formatted) with my attached patch:
{code}
szhou@host:~$ java -jar cmdline-jmxclient-0.10.3.jar - localhost:7199 org.a=
pache.cassandra.db:type=3DDynamicEndpointSnitch LatencyScores
06/01/2017 23:30:36 +0000 org.archive.jmx.Client LatencyScores: {
/node1=3D0.7832167832167832
/node2=3D0.0
/node3=3D1.0
/node4=3D0.0
/node5=3D0.0
/node6=3D0.43356643356643354
/node7=3D0.4825174825174825
/node8=3D0.0
/node9=3D0.8881118881118881
/node10=3D0.0
/node11=3D0.9440559440559441
/node12=3D0.0
/node13=3D0.0
/node14=3D0.0
/node15=3D0.0
/node16=3D0.0}
szhou@host:~$ java -jar cmdline-jmxclient-0.10.3.jar - localhost:7199 org.a=
pache.cassandra.db:type=3DDynamicEndpointSnitch LatencyScores
06/01/2017 23:30:45 +0000 org.archive.jmx.Client LatencyScores: {/10.165.10=
.5=3D0.7832167832167832
/node1=3D0.0
/node2=3D1.0
/node3=3D0.0
/node4=3D0.0
/node5=3D0.43356643356643354
/node6=3D0.4825174825174825
/node7=3D0.0
/node8=3D0.8881118881118881
/node9=3D0.0
/node10=3D0.9440559440559441
/node11=3D0.0
/node12=3D0.0
/node13=3D0.0
/node15=3D0.0
/node16=3D0.0}
szhou@host:~$ java -jar cmdline-jmxclient-0.10.3.jar - localhost:7199 org.a=
pache.cassandra.db:type=3DDynamicEndpointSnitch IOWaitScores
06/01/2017 23:30:54 +0000 org.archive.jmx.Client IOWaitScores: {
/node1=3D5.084033489227295
/node2=3D4.024896621704102
/node3=3D4.54736852645874
/node4=3D4.947588920593262
/node5=3D3.4599156379699707
/node6=3D4.0653815269470215
/node7=3D6.989473819732666
/node8=3D3.371259927749634
/node9=3D5.800169467926025
/node10=3D3.2855939865112305
/node11=3D5.631399154663086
/node12=3D5.484004974365234
/node13=3D0.9635525941848755
/node14=3D1.5043878555297852
/node15=3D6.481481552124023
/node16=3D3.751563310623169}
{code}

Yes we can workaround the issue by increasing the badness_threshold. But th=
e problems are:
1. The default threshold doesn't work well.
2. iowait (percentage) is not a good measurement of end to end latency, not=
 only because it changes frequently, from second to second, but also it's j=
ust a low level metric that doesn't reflect the whole picture, which should=
 also include GC/safepoint pauses, thread scheduling delays, etc.
3. Instead of using median read latency, can we use maybe p95 latency as a =
better factor when calculating scores? I haven't experimented this yet.

[~brandon.williams] what do you think? [~kohlisankalp] Looks like we have s=
ome fix (or improvements?) in 4.0 but you mentioned in a meeting that DES c=
ould be improved. I'd also like get your ideas on this. I can work on this =
if we can agree on something.

> Dynamic endpoint snitch destabilizes cluster under heavy load
> -------------------------------------------------------------
>
>                 Key: CASSANDRA-6908
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-6908
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Configuration
>            Reporter: Bart=C5=82omiej Roma=C5=84ski
>            Assignee: Brandon Williams
>         Attachments: 0001-Decouple-IO-scores-and-latency-scores-from-Dyna=
micEn.patch, as-dynamic-snitch-disabled.png
>
>
> We observe that with dynamic snitch disabled our cluster is much more sta=
ble than with dynamic snitch enabled.
> We've got a 15 nodes cluster with pretty strong machines (2xE5-2620, 64 G=
B RAM, 2x480 GB SSD). We mostly do reads (about 300k/s).
> We use Astyanax on client side with TOKEN_AWARE option enabled. It automa=
tically direct read queries to one of the nodes responsible the given token=
.
> In that case with dynamic snitch disabled Cassandra always handles read l=
ocally. With dynamic snitch enabled Cassandra very often decides to proxy t=
he read to some other node. This causes much higher CPU usage and produces =
much more garbage what results in more often GC pauses (young generation fi=
lls up quicker). By "much higher" and "much more" I mean 1.5-2x.
> I'm aware that higher dynamic_snitch_badness_threshold value should solve=
 that issue. The default value is 0.1. I've looked at scores exposed in JMX=
 and the problem is that our values seemed to be completely random. They ar=
e between usually 0.5 and 2.0, but changes randomly every time I hit refres=
h.
> Of course, I can set dynamic_snitch_badness_threshold to 5.0 or something=
 like that, but the result will be similar to simply disabling the dynamic =
switch at all (that's what we done).
> I've tried to understand what's the logic behind these scores and I'm not=
 sure if I get the idea...
> It's a sum (without any multipliers) of two components:
> - ratio of recent given node latency to recent average node latency
> - something called 'severity', what, if I analyzed the code correctly, is=
 a result of BackgroundActivityMonitor.getIOWait() - it's a ratio of "iowai=
t" CPU time to the whole CPU time as reported in /proc/stats (the ratio is =
multiplied by 100)
> In our case the second value is something around 0-2% but varies quite he=
avily every second.
> What's the idea behind simply adding this two values without any multipli=
ers (e.g the second one is in percentage while the first one is not)? Are w=
e sure this is the best possible way of calculating the final score?
> Is there a way too force Cassandra to use (much) longer samples? In our c=
ase we probably need that to get stable values. The 'severity' is calculate=
d for each second. The mean latency is calculated based on some magic, hard=
coded values (ALPHA =3D 0.75, WINDOW_SIZE =3D 100).=20
> Am I right that there's no way to tune that without hacking the code?
> I'm aware that there's dynamic_snitch_update_interval_in_ms property in t=
he config file, but that only determines how often the scores are recalcula=
ted not how long samples are taken. Is that correct?
> To sum up, It would be really nice to have more control over dynamic snit=
ch behavior or at least have the official option to disable it described in=
 the default config file (it took me some time to discover that we can just=
 disable it instead of hacking with dynamic_snitch_badness_threshold=3D1000=
).
> Currently for some scenarios (like ours - optimized cluster, token aware =
client, heavy load) it causes more harm than good.


--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@cassandra.apache.org
For additional commands, e-mail: commits-help@cassandra.apache.org