cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Joseph Lynch (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (CASSANDRA-14747) Evaluate 200 node, compression=none, encryption=none, coalescing=off
Date Thu, 04 Oct 2018 00:39:00 GMT

    [ https://issues.apache.org/jira/browse/CASSANDRA-14747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16637652#comment-16637652
] 

Joseph Lynch edited comment on CASSANDRA-14747 at 10/4/18 12:38 AM:
--------------------------------------------------------------------

[~jasobrown] Ok, I think we found the problem! In the new Netty code we explicitly set the
{{SO_SNDBUF}} [of the outbound socket|https://github.com/apache/cassandra/blob/47a10649dadbdea6960836a7c0fe6d271a476204/src/java/org/apache/cassandra/net/async/NettyFactory.java#L332]
to 64KB. This works great if you have a low latency connection, but for long fat networks
this is a serious issue as you restrict your bandwidth significantly due to a high [bandwidth
delay product|https://en.wikipedia.org/wiki/Bandwidth-delay_product]. In the tests we've been
running where we are trying to push a semi reasonable amount of traffic (like 8mbps) to peers
that are about 80ms away (us-east-1 to eu-west-1 is usually about [80ms|https://www.cloudping.co/]).
With a 64KB window size we just don't have enough bandwidth even though the actual link is
very high bandwidth. As we can see using {{iperf}} setting a 64KB buffer cripples throughput:
{noformat}
# On the eu-west-1 node X
$ iperf -s -p 8080
------------------------------------------------------------
Server listening on TCP port 8080
TCP window size: 12.0 MByte (default)
------------------------------------------------------------
[  4] local X port 8080 connected with Y port 26964
[ ID] Interval       Transfer     Bandwidth
[  4]  0.0-10.5 sec   506 MBytes   404 Mbits/sec
[  5] local X port 8080 connected with Y port 27050
[  5]  0.0-10.5 sec  8.50 MBytes  6.81 Mbits/sec

# On the us-east-1 node Y about 80ms away

$ iperf -N -c X -p 8080
------------------------------------------------------------
Client connecting to X, TCP port 8080
TCP window size: 12.0 MByte (default)
------------------------------------------------------------
[  3] local Y port 26964 connected with X port 8080
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-10.1 sec   506 MBytes   421 Mbits/sec

$ iperf -N -w 64K -c X -p 8080
------------------------------------------------------------
Client connecting to X, TCP port 8080
TCP window size:  128 KByte (WARNING: requested 64.0 KByte)
------------------------------------------------------------
[  3] local Y port 27050 connected with X port 8080
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-10.1 sec  8.50 MBytes  7.03 Mbits/sec
{noformat}

So instead of Cassandra getting the full link's bandwidth of 500mbps we're only able to get
7mbps. This is lower than the 8mbps we need to push so the us-east-1 -> eu-west-1 queues
effectively grow without bound until we start dropping messages.

I applied a [patch|https://gist.github.com/jolynch/966e0e52f34eff7a7b8ac8d5a9cb4b5d#file-fix-the-problem-diff]
which does not set {{SO_SNDBUF}} unless explicitly asked to and *everything is completely
wonderful* now. Some ways that things are wonderful:

1. The cpu usage is now on par with 3.0.x, and most of that CPU time is spent in compaction
(both in garbage creation and actual cpu time):

{noformat}
$ sjk  ttop -p $(pgrep -f Cassandra) -n 20 -o CPU
2018-10-03T23:56:40.889+0000 Process summary 
  process cpu=321.33%
  application cpu=301.46% (user=185.93% sys=115.52%)
  other: cpu=19.88% 
  thread count: 274
  GC time=5.27% (young=5.27%, old=0.00%)
  heap allocation rate 478mb/s
  safe point rate: 0.4 (events/s) avg. safe point pause: 135.64ms
  safe point sync time: 0.08% processing time: 5.38% (wallclock time)
[000135] user=49.03% sys=11.84% alloc=  142mb/s - CompactionExecutor:1
[000136] user=44.60% sys=13.81% alloc=  133mb/s - CompactionExecutor:2
[000198] user= 0.00% sys=41.46% alloc=  4833b/s - NonPeriodicTasks:1
[000010] user= 9.56% sys= 0.67% alloc=   57mb/s - spectator-gauge-polling-0
[000029] user= 7.45% sys= 2.13% alloc= 5772kb/s - PerDiskMemtableFlushWriter_0:1
[000036] user= 0.00% sys= 8.98% alloc=  2598b/s - PERIODIC-COMMIT-LOG-SYNCER
[000115] user= 5.74% sys= 2.22% alloc=   12mb/s - MessagingService-NettyInbound-Thread-3-1
[000118] user= 4.03% sys= 3.75% alloc= 2915kb/s - MessagingService-NettyOutbound-Thread-4-3
[000117] user= 3.12% sys= 2.79% alloc= 2110kb/s - MessagingService-NettyOutbound-Thread-4-2
[000144] user= 4.03% sys= 0.92% alloc= 7205kb/s - MutationStage-1
[000146] user= 4.13% sys= 0.77% alloc= 6837kb/s - Native-Transport-Requests-2
[000147] user= 3.12% sys= 1.49% alloc= 6054kb/s - MutationStage-3
[000150] user= 3.22% sys= 1.21% alloc= 6630kb/s - MutationStage-4
[000116] user= 2.72% sys= 1.61% alloc= 1412kb/s - MessagingService-NettyOutbound-Thread-4-1
[000132] user= 2.21% sys= 2.04% alloc=   11mb/s - MessagingService-NettyInbound-Thread-3-2
[000151] user= 2.92% sys= 1.30% alloc= 5462kb/s - Native-Transport-Requests-5
[000134] user= 2.11% sys= 1.71% alloc= 6212kb/s - MessagingService-NettyInbound-Thread-3-4
[000152] user= 3.02% sys= 0.65% alloc= 5357kb/s - MutationStage-6
[000133] user= 1.81% sys= 1.83% alloc= 6236kb/s - MessagingService-NettyInbound-Thread-3-3
[000153] user= 2.32% sys= 1.31% alloc= 4538kb/s - MutationStage-7
{noformat}

The flamegraph is also nice and clean!  [^4.0.15-after-sndbuf-fix.svg] 

2. The latencies for LOCAL_QUORUM are significantly better than 3.0.17. The average latency
has improved from 2ms to 1.2ms (40% better!!) and the 99th percentile has reduced from close
to 70ms to 25ms (60% better!!). I've attached a graph showing the significant latency improvements:
[^trunk_vs_3.0.17_latency_under_load.png] 

3. Cassandra trunk is no longer dropping any messages!


was (Author: jolynch):
[~jasobrown] Ok, I think we found the problem! In the new Netty code we explicitly set the
{{SO_SNDBUF}} [of the outbound socket|https://github.com/apache/cassandra/blob/47a10649dadbdea6960836a7c0fe6d271a476204/src/java/org/apache/cassandra/net/async/NettyFactory.java#L332]
to 64KB. This works great if you have a low latency connection, but for long fat networks
this is a serious issue as you restrict your bandwidth significantly due to a high [bandwidth
delay product|https://en.wikipedia.org/wiki/Bandwidth-delay_product]. In the tests we've been
running where we are trying to push a semi reasonable amount of traffic (like 8mbps) to peers
that are about 80ms away (us-east-1 to eu-west-1 is usually about [80ms|https://www.cloudping.co/]).
With a 64KB window size we just don't have enough bandwidth even though the actual link is
very high bandwidth. As we can see using {{iperf}} setting a 64KB buffer cripples throughput:
{noformat}
# On the eu-west-1 node X
$ iperf -s -p 8080
------------------------------------------------------------
Server listening on TCP port 8080
TCP window size: 12.0 MByte (default)
------------------------------------------------------------
[  4] local X port 8080 connected with Y port 26964
[ ID] Interval       Transfer     Bandwidth
[  4]  0.0-10.5 sec   506 MBytes   404 Mbits/sec
[  5] local X port 8080 connected with Y port 27050
[  5]  0.0-10.5 sec  8.50 MBytes  6.81 Mbits/sec

# On the us-east-1 node Y about 80ms away

$ iperf -N -c X -p 8080
------------------------------------------------------------
Client connecting to X, TCP port 8080
TCP window size: 12.0 MByte (default)
------------------------------------------------------------
[  3] local Y port 26964 connected with X port 8080
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-10.1 sec   506 MBytes   421 Mbits/sec

$ iperf -N -w 64K -c X -p 8080
------------------------------------------------------------
Client connecting to X, TCP port 8080
TCP window size:  128 KByte (WARNING: requested 64.0 KByte)
------------------------------------------------------------
[  3] local Y port 27050 connected with X port 8080
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-10.1 sec  8.50 MBytes  7.03 Mbits/sec
{noformat}

So instead of Cassandra getting the full link's bandwidth of 500mbps we're only able to get
7mbps. This is lower than the 8mbps we need to push so the us-east-1 -> eu-west-1 queues
effectively grow without bound until we start dropping messages.

I applied a [patch|https://gist.github.com/jolynch/966e0e52f34eff7a7b8ac8d5a9cb4b5d#file-fix-the-problem-diff]
which does not set {{SO_SNDBUF}} unless explicitly asked to and *everything is completely
wonderful* now. Some ways that things are wonderful:

1. The cpu usage is now on par with 3.0.x, and most of that CPU time is spent in compaction
(both in garbage creation and actual cpu time):

{noformat}
2018-10-03T23:56:40.889+0000 Process summary 
  process cpu=321.33%
  application cpu=301.46% (user=185.93% sys=115.52%)
  other: cpu=19.88% 
  thread count: 274
  GC time=5.27% (young=5.27%, old=0.00%)
  heap allocation rate 478mb/s
  safe point rate: 0.4 (events/s) avg. safe point pause: 135.64ms
  safe point sync time: 0.08% processing time: 5.38% (wallclock time)
[000135] user=49.03% sys=11.84% alloc=  142mb/s - CompactionExecutor:1
[000136] user=44.60% sys=13.81% alloc=  133mb/s - CompactionExecutor:2
[000198] user= 0.00% sys=41.46% alloc=  4833b/s - NonPeriodicTasks:1
[000010] user= 9.56% sys= 0.67% alloc=   57mb/s - spectator-gauge-polling-0
[000029] user= 7.45% sys= 2.13% alloc= 5772kb/s - PerDiskMemtableFlushWriter_0:1
[000036] user= 0.00% sys= 8.98% alloc=  2598b/s - PERIODIC-COMMIT-LOG-SYNCER
[000115] user= 5.74% sys= 2.22% alloc=   12mb/s - MessagingService-NettyInbound-Thread-3-1
[000118] user= 4.03% sys= 3.75% alloc= 2915kb/s - MessagingService-NettyOutbound-Thread-4-3
[000117] user= 3.12% sys= 2.79% alloc= 2110kb/s - MessagingService-NettyOutbound-Thread-4-2
[000144] user= 4.03% sys= 0.92% alloc= 7205kb/s - MutationStage-1
[000146] user= 4.13% sys= 0.77% alloc= 6837kb/s - Native-Transport-Requests-2
[000147] user= 3.12% sys= 1.49% alloc= 6054kb/s - MutationStage-3
[000150] user= 3.22% sys= 1.21% alloc= 6630kb/s - MutationStage-4
[000116] user= 2.72% sys= 1.61% alloc= 1412kb/s - MessagingService-NettyOutbound-Thread-4-1
[000132] user= 2.21% sys= 2.04% alloc=   11mb/s - MessagingService-NettyInbound-Thread-3-2
[000151] user= 2.92% sys= 1.30% alloc= 5462kb/s - Native-Transport-Requests-5
[000134] user= 2.11% sys= 1.71% alloc= 6212kb/s - MessagingService-NettyInbound-Thread-3-4
[000152] user= 3.02% sys= 0.65% alloc= 5357kb/s - MutationStage-6
[000133] user= 1.81% sys= 1.83% alloc= 6236kb/s - MessagingService-NettyInbound-Thread-3-3
[000153] user= 2.32% sys= 1.31% alloc= 4538kb/s - MutationStage-7
{noformat}

The flamegraph is also nice and clean!  [^4.0.15-after-sndbuf-fix.svg] 

2. The latencies for LOCAL_QUORUM are significantly better than 3.0.17. The average latency
has improved from 2ms to 1.2ms (40% better!!) and the 99th percentile has reduced from close
to 70ms to 25ms (60% better!!). I've attached a graph showing the significant latency improvements:
[^trunk_vs_3.0.17_latency_under_load.png] 

3. Cassandra trunk is no longer dropping any messages!

> Evaluate 200 node, compression=none, encryption=none, coalescing=off 
> ---------------------------------------------------------------------
>
>                 Key: CASSANDRA-14747
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-14747
>             Project: Cassandra
>          Issue Type: Sub-task
>            Reporter: Joseph Lynch
>            Assignee: Joseph Lynch
>            Priority: Major
>         Attachments: 3.0.17-QPS.png, 4.0.1-QPS.png, 4.0.11-after-jolynch-tweaks.svg,
4.0.12-after-unconditional-flush.svg, 4.0.15-after-sndbuf-fix.svg, 4.0.7-before-my-changes.svg,
4.0_errors_showing_heap_pressure.txt, 4.0_heap_histogram_showing_many_MessageOuts.txt, i-0ed2acd2dfacab7c1-after-looping-fixes.svg,
trunk_vs_3.0.17_latency_under_load.png, ttop_NettyOutbound-Thread_spinning.txt, useast1c-i-0e1ddfe8b2f769060-mutation-flame.svg,
useast1e-i-08635fa1631601538_flamegraph_96node.svg, useast1e-i-08635fa1631601538_ttop_netty_outbound_threads_96nodes,
useast1e-i-08635fa1631601538_uninlinedcpuflamegraph.0_96node_60sec_profile.svg
>
>
> Tracks evaluating a 200 node cluster with all internode settings off (no compression,
no encryption, no coalescing).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@cassandra.apache.org
For additional commands, e-mail: commits-help@cassandra.apache.org


Mime
View raw message