cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Alain RODRIGUEZ (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (CASSANDRA-11363) Blocked NTR When Connecting Causing Excessive Load
Date Thu, 07 Apr 2016 14:38:25 GMT

    [ https://issues.apache.org/jira/browse/CASSANDRA-11363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15230325#comment-15230325
] 

Alain RODRIGUEZ edited comment on CASSANDRA-11363 at 4/7/16 2:37 PM:
---------------------------------------------------------------------

I also observed in C*2.1.12 that a certain percentage of the Native-Transport-Requests are
blocked, yet no major CPU or resources issue on my side though, it might then not be related.

For what it is worth here is something I observed about Native-Transport-Requests: increasing
the 'native_transport_max_threads' value help mitigating this, as expected, but Native-Transport-Requests
number is still a non zero value.

{noformat}
[alain~]$ knife ssh "role:cassandra" "nodetool tpstats | grep Native-Transport-Requests" |
grep -e server1 -e server2 -e server3 -e server4 | sort | awk 'BEGIN { printf "%50s %10s","Server
|"," Blocked ratio:\n" } { printf "%50s %10f%\n", $1, (($7/$5)*100) }'
       Server |  Blocked ratio:
       server3   0.044902%
       server4   0.030127%
       server2   0.045759%
       server1   0.082763%
{noformat}

I waited long enough between the change and the result capture, many days, probably a few
weeks. As all the nodes are in the same datacenter, under a (fairly) balanced load, this is
probably relevant.

Here are the result for those nodes, in our use case.

||Server||native_transport_max_threads||Percentage of blocked threads||
|server1|128|0.082763%|
|server2|384|0.044902%|
|server3|512|0.045759%|
|server4|1024|0.030127%|

Also from the mailing list outputs, it looks like it is quite common to have some Native-Transport-Requests
blocked, probably unavoidable depending on the network and use cases (spiky workloads?). 


was (Author: arodrime):
I also observed in C*2.1.12 that a certain percentage of the Native-Transport-Requests are
blocked, yet no major CPU or resources issue on my side though, it might then not be related.

For what it is worth here is something I observed about Native-Transport-Requests: increasing
the 'native_transport_max_threads' value help mitigating this, as expected, but Native-Transport-Requests
number is still a non zero value.

{noformat}
[alain@bastion-d3-prod ~]$ knife ssh "role:cassandra" "nodetool tpstats | grep Native-Transport-Requests"
| grep -e server1 -e server2 -e server3 -e server4 | sort | awk 'BEGIN { printf "%50s %10s","Server
|"," Blocked ratio:\n" } { printf "%50s %10f%\n", $1, (($7/$5)*100) }'
                                          Server |  Blocked ratio:
       ip-172-17-42-105.us-west-2.compute.internal   0.044902%
       ip-172-17-42-107.us-west-2.compute.internal   0.030127%
       ip-172-17-42-114.us-west-2.compute.internal   0.045759%
       ip-172-17-42-116.us-west-2.compute.internal   0.082763%
{noformat}

I waited long enough between the change and the result capture, many days, probably a few
weeks. As all the nodes are in the same datacenter, under a (fairly) balanced load, this is
probably relevant.

Here are the result for those nodes, in our use case.

||Server||native_transport_max_threads||Percentage of blocked threads||
|server1|128|0.082763%|
|server2|384|0.044902%|
|server3|512|0.045759%|
|server4|1024|0.030127%|

Also from the mailing list outputs, it looks like it is quite common to have some Native-Transport-Requests
blocked, probably unavoidable depending on the network and use cases (spiky workloads?). 

> Blocked NTR When Connecting Causing Excessive Load
> --------------------------------------------------
>
>                 Key: CASSANDRA-11363
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-11363
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Coordination
>            Reporter: Russell Bradberry
>         Attachments: cassandra-102-cms.stack, cassandra-102-g1gc.stack
>
>
> When upgrading from 2.1.9 to 2.1.13, we are witnessing an issue where the machine load
increases to very high levels (> 120 on an 8 core machine) and native transport requests
get blocked in tpstats.
> I was able to reproduce this in both CMS and G1GC as well as on JVM 7 and 8.
> The issue does not seem to affect the nodes running 2.1.9.
> The issue seems to coincide with the number of connections OR the number of total requests
being processed at a given time (as the latter increases with the former in our system)
> Currently there is between 600 and 800 client connections on each machine and each machine
is handling roughly 2000-3000 client requests per second.
> Disabling the binary protocol fixes the issue for this node but isn't a viable option
cluster-wide.
> Here is the output from tpstats:
> {code}
> Pool Name                    Active   Pending      Completed   Blocked  All time blocked
> MutationStage                     0         8        8387821         0              
  0
> ReadStage                         0         0         355860         0              
  0
> RequestResponseStage              0         7        2532457         0              
  0
> ReadRepairStage                   0         0            150         0              
  0
> CounterMutationStage             32       104         897560         0              
  0
> MiscStage                         0         0              0         0              
  0
> HintedHandoff                     0         0             65         0              
  0
> GossipStage                       0         0           2338         0              
  0
> CacheCleanupExecutor              0         0              0         0              
  0
> InternalResponseStage             0         0              0         0              
  0
> CommitLogArchiver                 0         0              0         0              
  0
> CompactionExecutor                2       190            474         0              
  0
> ValidationExecutor                0         0              0         0              
  0
> MigrationStage                    0         0             10         0              
  0
> AntiEntropyStage                  0         0              0         0              
  0
> PendingRangeCalculator            0         0            310         0              
  0
> Sampler                           0         0              0         0              
  0
> MemtableFlushWriter               1        10             94         0              
  0
> MemtablePostFlush                 1        34            257         0              
  0
> MemtableReclaimMemory             0         0             94         0              
  0
> Native-Transport-Requests       128       156         387957        16            278451
> Message type           Dropped
> READ                         0
> RANGE_SLICE                  0
> _TRACE                       0
> MUTATION                     0
> COUNTER_MUTATION             0
> BINARY                       0
> REQUEST_RESPONSE             0
> PAGED_RANGE                  0
> READ_REPAIR                  0
> {code}
> Attached is the jstack output for both CMS and G1GC.
> Flight recordings are here:
> https://s3.amazonaws.com/simple-logs/cassandra-102-cms.jfr
> https://s3.amazonaws.com/simple-logs/cassandra-102-g1gc.jfr
> It is interesting to note that while the flight recording was taking place, the load
on the machine went back to healthy, and when the flight recording finished the load went
back to > 100.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message