From commits-return-214944-archive-asf-public=cust-asf.ponee.io@cassandra.apache.org Thu Oct 4 02:30:04 2018 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by mx-eu-01.ponee.io (Postfix) with SMTP id C29C118077A for ; Thu, 4 Oct 2018 02:30:03 +0200 (CEST) Received: (qmail 92851 invoked by uid 500); 4 Oct 2018 00:30:02 -0000 Mailing-List: contact commits-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@cassandra.apache.org Delivered-To: mailing list commits@cassandra.apache.org Received: (qmail 92821 invoked by uid 99); 4 Oct 2018 00:30:02 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd1-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 04 Oct 2018 00:30:02 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id 559E0C7969 for ; Thu, 4 Oct 2018 00:30:02 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -109.501 X-Spam-Level: X-Spam-Status: No, score=-109.501 tagged_above=-999 required=6.31 tests=[ENV_AND_HDR_SPF_MATCH=-0.5, KAM_ASCII_DIVIDERS=0.8, RCVD_IN_DNSWL_MED=-2.3, SPF_PASS=-0.001, USER_IN_DEF_SPF_WL=-7.5, USER_IN_WHITELIST=-100] autolearn=disabled Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024) with ESMTP id zZk-Elc6RoJj for ; Thu, 4 Oct 2018 00:30:01 +0000 (UTC) Received: from mailrelay1-us-west.apache.org (mailrelay1-us-west.apache.org [209.188.14.139]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTP id EF5195F3F2 for ; Thu, 4 Oct 2018 00:30:00 +0000 (UTC) Received: from jira-lw-us.apache.org (unknown [207.244.88.139]) by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id 774F6E0101 for ; Thu, 4 Oct 2018 00:30:00 +0000 (UTC) Received: from jira-lw-us.apache.org (localhost [127.0.0.1]) by jira-lw-us.apache.org (ASF Mail Server at jira-lw-us.apache.org) with ESMTP id 3B5CE25546 for ; Thu, 4 Oct 2018 00:30:00 +0000 (UTC) Date: Thu, 4 Oct 2018 00:30:00 +0000 (UTC) From: "Joseph Lynch (JIRA)" To: commits@cassandra.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Comment Edited] (CASSANDRA-14747) Evaluate 200 node, compression=none, encryption=none, coalescing=off MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/CASSANDRA-14747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16637652#comment-16637652 ] Joseph Lynch edited comment on CASSANDRA-14747 at 10/4/18 12:29 AM: -------------------------------------------------------------------- [~jasobrown] Ok, I think we found the problem! In the new Netty code we explicitly set the {{SO_SNDBUF}} [of the outbound socket|https://github.com/apache/cassandra/blob/47a10649dadbdea6960836a7c0fe6d271a476204/src/java/org/apache/cassandra/net/async/NettyFactory.java#L332] to 64KB. This works great if you have a low latency connection, but for long fat networks this is a serious issue as you restrict your bandwidth significantly due to a high [bandwidth delay product|https://en.wikipedia.org/wiki/Bandwidth-delay_product]. In the tests we've been running where we are trying to push a semi reasonable amount of traffic (like 8mbps) to peers that are about 80ms away (us-east-1 to eu-west-1 is usually about [80ms|https://www.cloudping.co/]). With a 64KB window size we just don't have enough bandwidth even though the actual link is very high bandwidth. As we can see using {{iperf}} setting a 64KB buffer cripples throughput: {noformat} # On the eu-west-1 node X $ iperf -s -p 8080 ------------------------------------------------------------ Server listening on TCP port 8080 TCP window size: 12.0 MByte (default) ------------------------------------------------------------ [ 4] local X port 8080 connected with Y port 26964 [ ID] Interval Transfer Bandwidth [ 4] 0.0-10.5 sec 506 MBytes 404 Mbits/sec [ 5] local X port 8080 connected with Y port 27050 [ 5] 0.0-10.5 sec 8.50 MBytes 6.81 Mbits/sec # On the us-east-1 node Y about 80ms away $ iperf -N -c X -p 8080 ------------------------------------------------------------ Client connecting to X, TCP port 8080 TCP window size: 12.0 MByte (default) ------------------------------------------------------------ [ 3] local Y port 26964 connected with X port 8080 [ ID] Interval Transfer Bandwidth [ 3] 0.0-10.1 sec 506 MBytes 421 Mbits/sec $ iperf -N -w 64K -c X -p 8080 ------------------------------------------------------------ Client connecting to X, TCP port 8080 TCP window size: 128 KByte (WARNING: requested 64.0 KByte) ------------------------------------------------------------ [ 3] local Y port 27050 connected with X port 8080 [ ID] Interval Transfer Bandwidth [ 3] 0.0-10.1 sec 8.50 MBytes 7.03 Mbits/sec {noformat} So instead of Cassandra getting the full link's bandwidth of 500mbps we're only able to get 7mbps. This is lower than the 8mbps we need to push so the us-east-1 -> eu-west-1 queues effectively grow without bound until we start dropping messages. I applied a [patch|https://gist.github.com/jolynch/966e0e52f34eff7a7b8ac8d5a9cb4b5d#file-fix-the-problem-diff] which does not set {{SO_SNDBUF}} unless explicitly asked to and *everything is completely wonderful* now. Some ways that things are wonderful: 1. The cpu usage is now on par with 3.0.x, and most of that CPU time is spent in compaction (both in garbage creation and actual cpu time): {noformat} 2018-10-03T23:56:40.889+0000 Process summary process cpu=321.33% application cpu=301.46% (user=185.93% sys=115.52%) other: cpu=19.88% thread count: 274 GC time=5.27% (young=5.27%, old=0.00%) heap allocation rate 478mb/s safe point rate: 0.4 (events/s) avg. safe point pause: 135.64ms safe point sync time: 0.08% processing time: 5.38% (wallclock time) [000135] user=49.03% sys=11.84% alloc= 142mb/s - CompactionExecutor:1 [000136] user=44.60% sys=13.81% alloc= 133mb/s - CompactionExecutor:2 [000198] user= 0.00% sys=41.46% alloc= 4833b/s - NonPeriodicTasks:1 [000010] user= 9.56% sys= 0.67% alloc= 57mb/s - spectator-gauge-polling-0 [000029] user= 7.45% sys= 2.13% alloc= 5772kb/s - PerDiskMemtableFlushWriter_0:1 [000036] user= 0.00% sys= 8.98% alloc= 2598b/s - PERIODIC-COMMIT-LOG-SYNCER [000115] user= 5.74% sys= 2.22% alloc= 12mb/s - MessagingService-NettyInbound-Thread-3-1 [000118] user= 4.03% sys= 3.75% alloc= 2915kb/s - MessagingService-NettyOutbound-Thread-4-3 [000117] user= 3.12% sys= 2.79% alloc= 2110kb/s - MessagingService-NettyOutbound-Thread-4-2 [000144] user= 4.03% sys= 0.92% alloc= 7205kb/s - MutationStage-1 [000146] user= 4.13% sys= 0.77% alloc= 6837kb/s - Native-Transport-Requests-2 [000147] user= 3.12% sys= 1.49% alloc= 6054kb/s - MutationStage-3 [000150] user= 3.22% sys= 1.21% alloc= 6630kb/s - MutationStage-4 [000116] user= 2.72% sys= 1.61% alloc= 1412kb/s - MessagingService-NettyOutbound-Thread-4-1 [000132] user= 2.21% sys= 2.04% alloc= 11mb/s - MessagingService-NettyInbound-Thread-3-2 [000151] user= 2.92% sys= 1.30% alloc= 5462kb/s - Native-Transport-Requests-5 [000134] user= 2.11% sys= 1.71% alloc= 6212kb/s - MessagingService-NettyInbound-Thread-3-4 [000152] user= 3.02% sys= 0.65% alloc= 5357kb/s - MutationStage-6 [000133] user= 1.81% sys= 1.83% alloc= 6236kb/s - MessagingService-NettyInbound-Thread-3-3 [000153] user= 2.32% sys= 1.31% alloc= 4538kb/s - MutationStage-7 {noformat} The flamegraph is also nice and clean! 2. The latencies for LOCAL_QUORUM are significantly better than 3.0.17. The average latency has improved from 2ms to 1.2ms (40% better!!) and the 99th percentile has reduced from close to 70ms to 25ms (60% better!!). I've attached a graph showing the significant latency improvements. !trunk_vs_3.0.17_latency_under_load.png! 3. Cassandra trunk is no longer dropping any messages! was (Author: jolynch): [~jasobrown] Ok, I think we found the problem! In the new Netty code we explicitly set the {{SO_SNDBUF}} [of the outbound socket|https://github.com/apache/cassandra/blob/47a10649dadbdea6960836a7c0fe6d271a476204/src/java/org/apache/cassandra/net/async/NettyFactory.java#L332] to 64KB. This works great if you have a low latency connection, but for long fat networks this is a serious issue as you restrict your bandwidth significantly due to a high [bandwidth delay product|https://en.wikipedia.org/wiki/Bandwidth-delay_product]. In the tests we've been running where we are trying to push a semi reasonable amount of traffic (like 8mbps) to peers that are about 80ms away (us-east-1 to eu-west-1 is usually about [80ms|https://www.cloudping.co/]). With a 64KB window size we just don't have enough bandwidth even though the actual link is very high bandwidth. As we can see using {{iperf}} setting a 64KB buffer cripples throughput: {noformat} # On the eu-west-1 node X $ iperf -s -p 8080 ------------------------------------------------------------ Server listening on TCP port 8080 TCP window size: 12.0 MByte (default) ------------------------------------------------------------ [ 4] local X port 8080 connected with Y port 26964 [ ID] Interval Transfer Bandwidth [ 4] 0.0-10.5 sec 506 MBytes 404 Mbits/sec [ 5] local X port 8080 connected with Y port 27050 [ 5] 0.0-10.5 sec 8.50 MBytes 6.81 Mbits/sec # On the us-east-1 node Y about 80ms away $ iperf -N -c X -p 8080 ------------------------------------------------------------ Client connecting to X, TCP port 8080 TCP window size: 12.0 MByte (default) ------------------------------------------------------------ [ 3] local Y port 26964 connected with X port 8080 [ ID] Interval Transfer Bandwidth [ 3] 0.0-10.1 sec 506 MBytes 421 Mbits/sec $ iperf -N -w 64K -c X -p 8080 ------------------------------------------------------------ Client connecting to X, TCP port 8080 TCP window size: 128 KByte (WARNING: requested 64.0 KByte) ------------------------------------------------------------ [ 3] local Y port 27050 connected with X port 8080 [ ID] Interval Transfer Bandwidth [ 3] 0.0-10.1 sec 8.50 MBytes 7.03 Mbits/sec {noformat} So instead of Cassandra getting the full link's bandwidth of 500mbps we're only able to get 7mbps. This is lower than the 8mbps we need to push so the us-east-1 -> eu-west-1 queues effectively grow without bound until we start dropping messages. I applied a [patch|https://gist.github.com/jolynch/966e0e52f34eff7a7b8ac8d5a9cb4b5d#file-fix-the-problem-diff] which does not set {{SO_SNDBUF}} unless explicitly asked to and *everything is completely wonderful* now. Some ways that things are wonderful: 1. The cpu usage is now on par with 3.0.x, and most of that CPU time is spent in compaction (both in garbage creation and actual cpu time): {noformat} 2018-10-03T23:56:40.889+0000 Process summary process cpu=321.33% application cpu=301.46% (user=185.93% sys=115.52%) other: cpu=19.88% thread count: 274 GC time=5.27% (young=5.27%, old=0.00%) heap allocation rate 478mb/s safe point rate: 0.4 (events/s) avg. safe point pause: 135.64ms safe point sync time: 0.08% processing time: 5.38% (wallclock time) [000135] user=49.03% sys=11.84% alloc= 142mb/s - CompactionExecutor:1 [000136] user=44.60% sys=13.81% alloc= 133mb/s - CompactionExecutor:2 [000198] user= 0.00% sys=41.46% alloc= 4833b/s - NonPeriodicTasks:1 [000010] user= 9.56% sys= 0.67% alloc= 57mb/s - spectator-gauge-polling-0 [000029] user= 7.45% sys= 2.13% alloc= 5772kb/s - PerDiskMemtableFlushWriter_0:1 [000036] user= 0.00% sys= 8.98% alloc= 2598b/s - PERIODIC-COMMIT-LOG-SYNCER [000115] user= 5.74% sys= 2.22% alloc= 12mb/s - MessagingService-NettyInbound-Thread-3-1 [000118] user= 4.03% sys= 3.75% alloc= 2915kb/s - MessagingService-NettyOutbound-Thread-4-3 [000117] user= 3.12% sys= 2.79% alloc= 2110kb/s - MessagingService-NettyOutbound-Thread-4-2 [000144] user= 4.03% sys= 0.92% alloc= 7205kb/s - MutationStage-1 [000146] user= 4.13% sys= 0.77% alloc= 6837kb/s - Native-Transport-Requests-2 [000147] user= 3.12% sys= 1.49% alloc= 6054kb/s - MutationStage-3 [000150] user= 3.22% sys= 1.21% alloc= 6630kb/s - MutationStage-4 [000116] user= 2.72% sys= 1.61% alloc= 1412kb/s - MessagingService-NettyOutbound-Thread-4-1 [000132] user= 2.21% sys= 2.04% alloc= 11mb/s - MessagingService-NettyInbound-Thread-3-2 [000151] user= 2.92% sys= 1.30% alloc= 5462kb/s - Native-Transport-Requests-5 [000134] user= 2.11% sys= 1.71% alloc= 6212kb/s - MessagingService-NettyInbound-Thread-3-4 [000152] user= 3.02% sys= 0.65% alloc= 5357kb/s - MutationStage-6 [000133] user= 1.81% sys= 1.83% alloc= 6236kb/s - MessagingService-NettyInbound-Thread-3-3 [000153] user= 2.32% sys= 1.31% alloc= 4538kb/s - MutationStage-7 {noformat} 2. The latencies for LOCAL_QUORUM are significantly better than 3.0.17. The average latency has improved from 2ms to 1.2ms (40% better!!) and the 99th percentile has reduced from close to 70ms to 25ms (60% better!!). I've attached a graph showing the significant latency improvements. !trunk_vs_3.0.17_latency_under_load.png! 3. Cassandra trunk is no longer dropping any messages! > Evaluate 200 node, compression=none, encryption=none, coalescing=off > --------------------------------------------------------------------- > > Key: CASSANDRA-14747 > URL: https://issues.apache.org/jira/browse/CASSANDRA-14747 > Project: Cassandra > Issue Type: Sub-task > Reporter: Joseph Lynch > Assignee: Joseph Lynch > Priority: Major > Attachments: 3.0.17-QPS.png, 4.0.1-QPS.png, 4.0.11-after-jolynch-tweaks.svg, 4.0.12-after-unconditional-flush.svg, 4.0.15-after-sndbuf-fix.svg, 4.0.7-before-my-changes.svg, 4.0_errors_showing_heap_pressure.txt, 4.0_heap_histogram_showing_many_MessageOuts.txt, i-0ed2acd2dfacab7c1-after-looping-fixes.svg, trunk_vs_3.0.17_latency_under_load.png, ttop_NettyOutbound-Thread_spinning.txt, useast1c-i-0e1ddfe8b2f769060-mutation-flame.svg, useast1e-i-08635fa1631601538_flamegraph_96node.svg, useast1e-i-08635fa1631601538_ttop_netty_outbound_threads_96nodes, useast1e-i-08635fa1631601538_uninlinedcpuflamegraph.0_96node_60sec_profile.svg > > > Tracks evaluating a 200 node cluster with all internode settings off (no compression, no encryption, no coalescing). -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscribe@cassandra.apache.org For additional commands, e-mail: commits-help@cassandra.apache.org